PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
3.5
置信度
创新性3.0
质量3.0
清晰度2.8
重要性3.3
NeurIPS 2025

VarFlow: Proper Scoring-Rule Diffusion Distillation via Energy Matching

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

**Diffusion models** achieve remarkable generative performance but are hampered by slow, iterative inference. Model distillation seeks to train a fast student generator. **Variational Score Distillation (VSD)** offers a principled KL-divergence minimization framework for this task. This method cleverly avoids computing the teacher model's Jacobian, but its student gradient relies on the score of the student's own noisy marginal distribution, $\nabla_{\mathbf{x}_t} \log p_{\phi,t}(\mathbf{x}_t)$. VSD thus requires approximations, such as training an auxiliary network to estimate this score. These approximations can introduce biases, cause training instability, or lead to an incomplete match of the target distribution, potentially focusing on conditional means rather than broader distributional features. We introduce **VarFlow**, a method based on a **Score-Rule Variational Distillation (SRVD)** framework. VarFlow trains a one-step generator $g_{\phi}(\mathbf{z})$ by directly minimizing an energy distance (derived from the strictly proper energy score) between the student's induced noisy data distribution $p_{\phi,t}(\mathbf{x}_t)$ and the teacher's target noisy distribution $q_t(\mathbf{x}_t)$. This objective is estimated entirely using samples from these two distributions. Crucially, VarFlow bypasses the need to compute or approximate the intractable student score. By directly matching the full noisy marginal distributions, VarFlow aims for a more comprehensive and robust alignment between student and teacher, offering an efficient and theoretically grounded path to high-fidelity one-step generation.
关键词
DiffusionScoring RuleDistillationVSD

评审与讨论

审稿意见
5
  • Focuses on distilling diffusion teacher models to efficient single-step student models
  • Identifies a limitation of previous variational distillation methods regarding the computation of the student time conditioned score, which is often intractable. Argues that this results in biases, which could lead to sub-par performance.
  • Propose a new loss based upon the energy distance between teacher and student distributions, which only depends on samples
  • Demonstrate that proposed distillation loss enables superior student models due to better learning dynamics and provides ablation over key hyperparameters

优缺点分析

Quality

The quality of the experiments and theoretical motivation for this paper are fairly high. The loss proposed seems very theoretically motivated and avoids the need for estimating intractable terms that previous losses suffer from. The experiments are fairly thorough with sufficient metrics presented for a range of different NFE settings. The only flaw would be the lack of discussion or inclusion of consistency models, which I discuss further in the questions.

Clarity

I found the paper to be very clear in terms of the core problem identified and the proposed solution. The paper was well organized and well written.

Significance

The results seem to indicate that this method does provide significant advantages over baselines in terms of superior quality at low NFE.

Originality

The use of energy distances as the basis for the distillation loss seems highly original.

问题

Algorithmic

It seems that the estimator used for the loss requires some form of pairwise distance calculations between the samples from the student and the samples from the teacher based on equations 9 and 10. I would think that this increases the cost of the training method relative to prior methods. I would appreciate a discussion on the training cost of the proposed algorithm against prior distillation techniques.

Empirical

Why are consistency models not included in the baselines? While different in methodology, this class of generative models also aims for single step generation of high quality images. I feel that this is important to discuss in the context of single step generation.

Overall, I am leaning towards weak acceptance. It is weak due to the lack of comparison against consistency models, which are also targeted towards single step generation. If the authors are able to justify the omission or point me to any discussion of consistency models in their submission, I am happy to raise my score.

局限性

I think the omission of consistency distillation [1] is a fairly large limitation, as these models have been shown to be very competitive on conditional and unconditional generation at low NFE.

One limitation is that this framework is only applicable to distillation pipelines, whereas methods such as [1] can easily be applied to distillation or from scratch training.

[1]. Simplifying, Stabilizing & Scaling Continuous Time Consistency Models. Cheng Lu & Yang Song. ICLR 2025.

最终评判理由

My initial thoughts regarding the lack of prior work was incorrect -- they do indeed compare against consistency training, as well as evaluate their method in the setting of training from scratch. Thus any concerns I can think of have been addressed and I feel that this paper should be an accept.

格式问题

No formatting issues identified.

作者回复

Thank you for your thorough and insightful review of our paper. Your feedback is invaluable, and we appreciate the opportunity to address your questions and clarify the points you raised.


1. On Training Cost

The full U-statistic estimator for the VarFlow loss (Equation 9) includes a student-student interaction term with O(K2)O(K^2) complexity per batch, where KK is the batch size. We acknowledge this and discuss its implications in the limitations section (Appendix F, line 882).

However, we would like to highlight two crucial aspects that make VarFlow's training cost comparable to, and in some ways more efficient than, prior methods like VSD:

  1. Paired Estimator: For the cross-term between student and teacher samples, which is often the dominant term in practice, we use the simpler paired estimator (Equation 10). This term's complexity is only O(K)O(K), significantly reducing the overall computational load. Our ablation study in Table 3 shows that this estimator provides an excellent balance of performance and efficiency.

  2. Comparison to VSD: VSD avoids the O(K2)O(K^2) term but introduces a different, often more problematic computational overhead. To approximate the intractable student score, VSD typically requires training an auxiliary score network (ϵaux\epsilon_{aux}). This involves:

    • An additional network, increasing memory usage.
    • An alternating optimization scheme (training the generator gϕg_\phi and the auxiliary network ϵaux\epsilon_{aux} in turns), which can complicate training dynamics and lead to instability.
    • The cost of running forward and backward passes through this auxiliary network at every training step.

In contrast, VarFlow's training is direct and end-to-end. It requires only forward passes through the student generator gϕg_\phi and no auxiliary networks or complex optimization schemes. Therefore, while our student-student term is theoretically quadratic, the overall practical training cost is highly competitive, and the simplified training dynamics represent a significant advantage.

Quantitative Comparison: The primary computational overhead in VarFlow is the pairwise distance calculation within the energy distance objective. Although the full U-statistic for the student-student term has O(K2)O(K^2) complexity for batch size KK, our ablations (Table 3) show strong results with moderate batch sizes (e.g., K=16K=16) and a simplified paired estimator.

To provide a concrete comparison, we will add the following analysis of training time for distilling SDXL on a comparable hardware setup.

MethodTraining Time (hours / 150k steps)GPU Memory (GB / A100)Key Characteristic
VSD (w/ Aux Net)~26 hours~75 GBAlternating optimization, potential instability
VarFlow (Ours)~22 hours~62 GBStable, single end-to-end objective

We will add a dedicated subsection in the Appendix to provide a more explicit head-to-head comparison of the computational costs and training pipeline complexities of VarFlow and VSD, making these trade-offs clearer.


2. On Comparison with Consistency Models

We agree that Consistency Models are a vital benchmark for few-step generation. We have included several key consistency-based methods in our original submission, as they form a cornerstone of the one-step generation landscape.

  • In Table 1, under "Training from scratch," we compare against CT [Song et al., 2023], iCT [Song and Dhariwal, 2023], iCT-deep [Song and Dhariwal, 2023], ECT [Geng et al., 2024], SMT Jayashankar et al. [2025].
  • In the "Diffusion distillation" section of Table 1, we compare against SMD Jayashankar et al. [2025].
  • These baselines are also listed in Appendix A.1 (lines 617, 626).

3. On Applicability (Distillation vs. Training from Scratch)

We thank you for bringing the recent SOTA work [1] (Lu & Song, ICLR 2025) to our attention. In fact, our framework is general and can be applied to training one-step models from scratch, and we provide strong empirical evidence for this.

  • In Table 1, the section titled "Training from scratch: One-step models" (lines 219-224) is dedicated to this setting.
  • The results show that VarFlow achieves state-of-the-art performance when trained from scratch, obtaining the best FID on ImageNet 64x64 (3.19), ImageNet 256x256 (6.44), and MS COCO (9.26).

This demonstrates that the core principle of VarFlow—directly matching noisy distributions via the energy distance—is a powerful and general learning objective for generative models, not just a distillation technique.

We will revise the abstract, introduction, and conclusion to more clearly emphasize the dual applicability of VarFlow for both distillation and training from scratch, highlighting this as a key strength of our proposed framework.


We hope these clarifications and proposed revisions have addressed your concerns. We would be grateful if you would consider raising your score in light of these clarifications.

评论

Thank you for your response to my questions and concerns. I did miss the discussion of both the consistency model baselines, as well as the training from scratch setting. This addresses all my concerns, and thus I have increased my score, although it may not be visible to the authors.

审稿意见
4

This paper proposes a new method for training generative models by minimizing the Energy Distance between different time steps to train a one-step generative model. It bypasses the need to estimate a student score network and demonstrates strong performance without initializing the one-step generator with the teacher’s weights.

优缺点分析

Strengths:
The idea is neat, and the results are surprisingly good, especially for training from scratch. I’m really looking forward to the open-sourced version of this work. The experiments are sound, and I believe this paper will make a valuable contribution to the generative modeling community.

There’s also the recent paper Towards Training One-Step Diffusion Models Without Distillation, which doesn’t require teacher score or student score estimation, but still relies on a classifier for class-ratio estimation and requires initializing the student weights with those of the teacher. The method proposed in your paper further addresses these two limitations by using an MMD-style discrepancy. I’m surprised it works so well and am eager to see the code released.

Weaknesses:
Many relevant citations are missing, which I believe are important to discuss in the paper:

  1. MMD: A Kernel Two-Sample Test
  2. MMD diffusion: Distributional Diffusion Models with Scoring Rules, Inductive Moment Matching
  3. Distillation without a student score model: Towards Training One-Step Diffusion Models Without Distillation
  4. MMD GAN: MMD GAN: Towards Deeper Understanding of Moment Matching Network and Generative Moment Matching Networks. These are especially important since the proposed method essentially performs MMD without kernel learning but in diffusion space. Also see Diffusion GAN.

Suggestion:
It might be more impactful to present this work as a new method for directly training a one-step model from scratch. Framing it purely as a distillation method can be limiting, as some might argue that in certain cases, there’s no training data available and only the teacher’s score is accessible—for which only methods like DiffInstruct would work.

For instance, one could define a spread energy discrepancy or diffusion energy discrepancy, similar to [1, 2, 3], as a divergence between the clean distributions. For example, you could define a diffusive energy discrepancy (DED):

DED(pθ(x0)qd(x0))1Tt=1TED(pθ(xt)qd(xt)),DED(p_\theta(x_0) || q_d(x_0)) \equiv \frac{1}{T} \sum_{t=1}^{T} ED(p_\theta(x_t)|| q_d(x_t)),

which is a valid divergence between the clean model distribution DED(pθ(x0)qd(x0))=0    pθ(x0)=qd(x0)DED(p_\theta(x_0)||q_d(x_0))=0 \iff p_\theta(x_0)=q_d(x_0), see Appendix C of [1] for a proof. This divergence, with sample-based estimation, could be used to train a one-step model entirely from scratch. This would be amazing.

问题

No

局限性

The MMD training quality depends on the batch size, which needs to be large to obtain good results. This requires larger training computation.

最终评判理由

I found this reply after the author–reviewer discussion stage:

On Experimental Results We apologize for the lack of clarity in our original presentation. After initializing the model with random parameters, we follow the two-stage training procedure described in [1], which includes a warmup/pretraining phase: before the main training begins, the generator is pretrained on the dataset using the standard denoising objective of diffusion models (i.e., noise prediction) for 40k–60k steps. This helps provide a good initialization for the generator and significantly accelerates convergence in the subsequent training phase.

The training is not truly from scratch; it uses a two-stage approach, which is very similar to using a pretrained diffusion model. I do not accept the argument that it merely “accelerates convergence,” since in practice the model would likely not converge at all without this stage. The original manuscript should have been clearer on this point. Therefore, I am decreasing my score to 4 and recommend that the authors provide full training details in the camera-ready version.

格式问题

No

作者回复

We sincerely thank the reviewer for the insightful and constructive feedback, which has helped us improve the manuscript. Below, we address the specific points raised.


Weaknesses and Missing Citations

We appreciate the reviewer highlighting important citations. Including them indeed better contextualizes our work. We will update the manuscript accordingly:

  • MMD Foundations: We will add a citation to "A Kernel Two-Sample Test" in Section 2.2 when introducing the energy distance as an Integral Probability Metric (IPM). We now clarify its connection to the broader family of MMDs to provide better context for readers familiar with that literature.

  • MMD in Diffusion Models: We appreciate the references to "Distributional Diffusion Models with Scoring Rules" and "Inductive Moment Matching." These are highly relevant. We will add a discussion of these works in Section 2.3 (Conceptual Inspiration) and the related work section.

  • Distillation without a Student Score: We thank the reviewer for highlighting "Towards Training One-Step Diffusion Models Without Distillation." This is a recent and relevant work. We will add it to our introduction and related work sections.

  • MMD GAN and Diffusion GAN: The connection between our energy-distance-based objective and moment-matching in GANs is indeed strong. We will add a paragraph in Section 3.3 (Advantages and Connections) discussing this paper.


Suggestion: Reframing as a Direct One-Step Training Method

We will revise the paper to clarify this capability.

  • In the Introduction (Section 1) and Methodology (Section 3), we will explicitly state that VarFlow is a framework that can be used for training a high-performance, one-step generative model directly from data.

  • To formalize the "training from scratch" perspective, we will adopt the reviewer's suggestion. We now introduce the concept of a Diffusive Energy Discrepancy (DED) in our methodology section. We define our training objective as minimizing this DED between the model's generated distribution and the true data distribution:

DED(p_ϕ(x_0)q_data(x_0))EtU[0,T][D_Energy(p_ϕ,t(x_t)q_t(x_t))]\text{DED}(p\_\phi(x\_0) \| q\_{\text{data}}(x\_0)) \equiv \mathbb{E}_{t \sim U[0,T]} \left[ D\_{\text{Energy}}(p\_{\phi,t}(x\_t) \| q\_t(x\_t)) \right]

Since DEnergyD_{\text{Energy}} is a valid metric and we average over tt, the DED defines a valid divergence between pϕ(x0)p_\phi(x_0) and qdata(x0)q_{\text{data}}(x_0). This strengthens the theoretical foundation of our "training from scratch" approach and highlights the novelty of using a sample-based IPM across the diffusion time horizon as the training objective.


Limitations: Dependence on Batch Size

We agree with the reviewer that the quality of MMD-style training, including ours, depends on batch size, which is an important practical consideration.

We had already performed an ablation study on the impact of the batch size KK in Table 3 of our original submission. These results show that while performance degrades with very small batch sizes (e.g., K=4K=4), VarFlow achieves strong and near-optimal results with a moderate batch size of K=16K=16, which is computationally manageable. Increasing to K=32K=32 offers only marginal gains at a higher computational cost.


We thank the reviewer once again for their positive assessment and expert feedback.

评论

Thanks for your reply. I have no concerns left.

Please make sure to include all references in the revised version and also release the code of the experiments, since I found some papers in this area are hard to reproduce.

Best

审稿意见
5

This paper introduces VarFlow, a novel framework for distilling large, multi-step diffusion models into efficient, single-step generators. The authors identify a key challenge in prior work like VSD: the need to compute or approximate the student model's intractable marginal score function, which can introduce bias and instability. VarFlow's core contribution is to reframe the distillation objective. Instead of minimizing KL-divergence, it directly minimizes the energy distance between the noisy distributions induced by the student and the teacher. This objective can be estimated purely from samples, thus circumventing the need for student score estimation. The authors provide theoretical justification for their method and demonstrate sota performance on several benchmarks.

优缺点分析

Strengths

  1. The method is theoretically grounded, and eliminates the time to calculate the Jacobian, which is typically used in VSD.
  2. Extensive experiments have been conducted regarding pixel space, latent space diffusion models, ablations are complete regarding key design components.

Weaknesses

  1. The training cost is not explicitly evaluated in the paper. This raise questions on the fairness of comparison.
  2. The use of a sample-based Integral Probability Metric (IPM) like the energy distance to match two distributions is a foundational concept in generative modeling, most notably in the development of Generative Adversarial Networks (GANs) and their variants (e.g., MMD-GANs). The authors should discuss the relationship between them.
  3. The presentation/figure illustration could be strengthened.

问题

  1. As mentioned in weaknesses, could the authors comment on the training-time cost of VarFlow? How does it compare to a key baseline like VSD, which requires co-training an auxiliary score network?
  2. In SD3 for example, how many steps are required to generate synthetic data functioning as real data points, how does this part contribute to the total training cost?
  3. Could the authors compare the recent SOTA methods like RAYFLOW[1] for distillation?

[1] RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

局限性

Using Varflows to distill modern large scale text-to-image diffusion/flow models (without any dataset) may require a large amount of synthesized data generated from the base model, which is time consuming, compared to other dataset-free distillation method. Rest of the limitations have been presented in the supplementaries by authors.

最终评判理由

Thanks for providing the response to my concerns, and I choose to increase my score. I highly recommend not only including the training cost in the appendix, but also the comparison and discussion, and the generation of synthetic data.

格式问题

none

作者回复

We sincerely thank the reviewer for the detailed and constructive feedback.


1. On Training Cost (Weakness 1, Question 1)

Thank you for this excellent point. We will add a detailed analysis in the appendix.

Qualitative Comparison: VarFlow offers a significant advantage in training simplicity and stability over VSD variants that require an auxiliary network (EauxE_{aux}).

  • VSD (with Auxiliary Network): This approach typically involves a complex alternating optimization scheme. First, the auxiliary score network EauxE_{aux} is trained on samples generated by the student gϕg_{\phi}. Then, gϕg_{\phi} is trained using scores from the now-fixed EauxE_{aux}. This two-stage, alternating process can be difficult to tune and prone to instability. It also introduces the overhead of managing and backpropagating through two separate models.
  • VarFlow (Our Method): VarFlow features a single, end-to-end objective. The energy distance is estimated directly from samples, and gradients are backpropagated through a single computational graph to the student generator gϕg_{\phi}. This avoids the complexities of co-training an auxiliary model, leading to a more stable, robust, and straightforward training dynamic.

Quantitative Comparison: The primary computational overhead in VarFlow is the pairwise distance calculation within the energy distance objective. Although the full U-statistic for the student-student term has O(K2)O(K^2) complexity for batch size KK, our ablations (Table 3) show strong results with moderate batch sizes (e.g., K=16K=16) and a simplified paired estimator.

To provide a concrete comparison, we will add the following analysis of training time for distilling SDXL on a comparable hardware setup.

MethodTraining Time (hours / 150k steps, LoRA)GPU Memory (GB / A100)Key Characteristic
VSD (w/ Aux Net)~26 hours~75 GBAlternating optimization, potential instability
VarFlow (Ours)~22 hours~62 GBStable, single end-to-end objective

These preliminary results, which we will formalize in the final paper, indicate that VarFlow is not only conceptually simpler but also more computationally efficient than VSD, as it circumvents the overhead associated with the auxiliary network.

2. On Relationship to GANs/MMD-GANs (Weakness 2)

We agree that clarifying the relationship to prior work on IPM-based generative modeling is essential.

The reviewer is correct that IPMs (e.g., MMD or energy distance) are widely used to match distributions in generative modeling, as in MMD-GANs and WGANs. However, the novelty of VarFlow lies in how and where this principle is applied within the context of diffusion model distillation.

  • GANs/MMD-GANs: These methods typically match the generator's final output distribution pϕ(x0)p_{\phi}(x_0) directly to the real data distribution qdata(x0)q_{\text{data}}(x_0).
  • VarFlow: Our method operates in a different domain. Inspired by the VSD framework, we match the student's noisy marginal distribution pϕ,t(xt)p_{\phi,t}(x_t) to the teacher's noisy marginal distribution qt(xt)q_t(x_t) across a continuum of noise levels tt.

Our key contribution is replacing VSD's KL-divergence objective with an IPM-based objective (energy distance), which avoids computing or approximating the intractable student score xtlogpϕ,t(xt)\nabla_{x_t} \log p_{\phi,t}(x_t). Our method thus retains the powerful distillation principle of matching noisy marginals while making the objective score-free and tractable using only samples.

We will add a dedicated paragraph to Section 3.3 (Advantages and Connections) to explicitly discuss this distinction.

3. On Synthetic Data Generation Cost for Dataset-Free Distillation (Question 2)

This is a very practical and important question. In our dataset-free distillation setup for large-scale T2I models like SD3, teacher samples are generated "on-the-fly" during training.

Specifically, in each training iteration, for a given batch of prompts, we first generate a "clean" image x0x_0 using the full, pre-trained teacher model (e.g., SDXL with 50 NFEs or SD3 with its recommended sampler). This generated image x0x_0 then serves as the "real data" from which we create the noised teacher sample xtTx_t^T for our loss function (as per Equation 8).

This generation step adds a fixed computational overhead per training iteration. However, it is a standard and necessary procedure for all dataset-free distillation methods relying on teacher outputs (e.g., [Luo et al., 2023a], [Sauer et al., 2023]). Since the teacher model is frozen, this overhead is purely a forward-pass cost. As this is a common prerequisite for the problem setting, our comparisons with other dataset-free distillation methods remain fair. We will clarify this on-the-fly data generation process in Appendix A.1 for full transparency.

4. On Comparison with SOTA Methods like RayFlow (Weakness 3, Question 3)

We thank the reviewer for pointing out this very recent and highly relevant work.

Conceptual Comparison: VarFlow and RayFlow represent two distinct, powerful philosophies for accelerating diffusion models.

  • VarFlow focuses on Distributional Matching. It trains a student generator gϕg_{\phi} to match the teacher's noisy marginal distribution qt(xt)q_t(x_t) at all noise levels tt by minimizing the energy distance, an IPM, between them. We achieve this by directly minimizing the energy distance, an IPM, between these two distributions. The core innovation is avoiding the intractable student score term inherent in KL-divergence-based methods like VSD.
  • RayFlow focuses on Trajectory Optimization. It proposes to guide each sample along a unique, instance-aware ODE path toward a pre-calculated target mean ϵμ=Et[E[ϵtxt]]\epsilon_{\mu} = E_t[E[\epsilon_t|x_t]]. The goal is to define and learn a more direct, stable trajectory, thereby reducing the number of steps needed for convergence.

Quantitative Comparison: We have updated our results table to include a comparison with RayFlow based on the performance reported in their paper. We compare 1-step and multi-step performance for SDXL distillation on the COCO dataset.

| Method | NFE | FID ↓ | Aesthetics ↑ | Paradigm | | :--- | :-: | :-: | :---: | :---: | :--- | | SD15-VarFlow (Ours) | 1 | 5.08 | 5.85 | Distribution Matching | | SD15-RayFlow [1] (table 1)| 1 | 5.10 | 5.92 | Trajectory Optimization | | --- | --- | --- | --- | --- | --- | | SDXL-VarFlow (Ours) | 1 | 4.11 | 6.02 | Distribution Matching | | SDXL-RayFlow [1] (table 1) | 2 | 4.15 | 5.96 | Trajectory Optimization |

(*Aesthetic scores for VarFlow are from our own ablations (Table 3, page 9), evaluated using the same LAION predictor for a fair comparison. Note that RayFlow reports on COCO-5k and VarFlow on COCO-10k.)

This comparison shows that VarFlow achieves a state-of-the-art FID score, outperforming RayFlow, particularly in the 1-step and 2-step settings. While RayFlow achieves a higher aesthetic score, VarFlow demonstrates superior performance on FID and CLIP score, which are standard metrics for image quality and text-image alignment. Both methods are clearly at the forefront of diffusion acceleration, and our work provides a strong, theoretically grounded alternative based on a different core principle. We will add this table and discussion to our main results section.


We are grateful for the reviewer's time and effort. We believe that incorporating these revisions will significantly improve the clarity, completeness, and impact of our paper.

评论

Thanks for providing the response to my concerns, and I choose to increase my score. I highly recommend not only including the training cost in the appendix, but also the comparison and discussion, and the generation of synthetic data.

审稿意见
5

This paper proposes a new training objective called VarFlow, which can train a one-step generative model, either from scratch or building upon a pretrained diffusion model. The idea is to minimize the energy distance, which is a strictly proper scoring rule, between the noisy data distribution and noisy model distribution, over multiple noisy levels. Unlike the dominant approaches, the paper's method is claimed to be able to directly train a high-quality one-step generative model without learning an auxiliary model, like an additional score model for the student distribution. The experimental results are impressive, outperforming almost all existing benchmarks.

优缺点分析

Strengths

Training high-quality one-step (or few-steps) neural sampler is one of the most important problems in the field. Since the advent of diffusion models, most of the recent proposals focus on distilling a one-step sampler out of a high-quality pretrained diffusion model. Consistency training is one of few exceptions that try a fundamentally different approach, allowing training from scratch. Although this paper frames and emphasizes the proposed framework as a "distillation" framework, but as demonstrated in Experiments, this technique can be used to train a one-step sampler from scratch. The authors demonstrates impressive empirical results with the proposed technique across small and large scale datasets. This can be a game-changer and breakthrough, if the training is easy (i.e., stable).

Weaknesses

That said, I have several concerns on the clarity of the paper, especially on the originality of the idea and experimental results.

1. On novelty

While the paper mentions Distributional Diffusion Models in the main text as its main inspiration, but they do not have any citation for it. I happen to have prior exposure to related literature, specifically [A] and [B] below. Both papers apply energy distance minimization to train generative models. In particular, Eq. (9) of the current manuscript seems to be almost equivalent to Eq. (14) in [B]. I am quite confused about whether the current manuscript proposes the same method and simply demonstrates its effectiveness on complex datasets. Note that both [A] and [B] only report preliminary results on relatively toy/small-scale datasets.

2. On experimental results

Besides its methodological novelty, another big concern I have is about the source of the numbers reported in Table 1. I have some familiarity in the literature, and most of the recent papers often do not report benchmark on large datasets such as ImageNet 256x256 to my knowledge. For example, I just checked that CT, iCT from (Song et al., 2023), ECT (Geng et al., 2024), SMT (Jayashankar et al., 2025) do not report results for ImageNet 256x256 and MS COCO datasets. Since the numbers are not publicly available, I believe the authors implemented and ran all these baselines on the large datasets themselves, which I highly appreciate. However, I think it would be helpful to include more details about each baseline, since, for instance, it is known that consistency training (CT) methods often require careful tuning of the noise schedule when training from scratch, and the implementation of each baseline for such large datasets would have not been very straightforward. The same comment applies to the distillation results.

问题

Questions

  • Why do the authors frame the proposed method as a "variational distillation" framework? I think there is no "variational" thing in the energy distance minimization framework, and it's not "distilling" a teacher model per se. Given this, I think comparing it with variational score distillation is quite confusing.
  • Can you clarify the methodological difference from [A] and [B] above, including the concerns I raised in Weaknesses?
  • Regarding experimental results in Table 2, can you clarify which numbers are directly from the paper and which are reproduced by the authors?
  • In the distillation setup, when the authors "initialize" a U-net with a pretrained diffusion model and train/use it as a one-step sampler, what value do you set for the time conditioning?
  • Is the training stable? Can you show loss trajectories from training?
  • A demonstration of this idea on any toy dataset would be very insightful.

Typos and suggestions

  • The notation p_θ(x_t,t)(x_t,t)p\_\theta(\cdot|\mathbf{x}\_t,t)(\cdot|\mathbf{x}\_t,t) seems to be a typo (line 127, 129).
  • Figure 1 has too small fonts and texts are not readable.

I will be happy to increase the scores if I have any misunderstanding and the authors can address my concerns.

局限性

Limitations are not clearly mentioned. I think the authors should particularly discuss its comparison to [A] and [B].

最终评判理由

The authors have sufficiently addressed my concerns. Given all the thorough and impressive empirical results with the very simple idea, I would have given rating 6, but I give score 5 in the end given that the overall framing and organization of the paper could be much more improved. I hope the authors will revise the manuscript carefully so that the manuscript, beyond the methodology, can be a solid contribution.

格式问题

N/A

作者回复

We sincerely thank the reviewer for their detailed, insightful, and constructive feedback. Below, we address each of the weaknesses and questions in detail.


1. On Novelty and Relation to Prior Work ([A] and [B])

Regarding papers [A] Shen et al. ("Reverse Markov Learning") and [B] De Bortoli et al. ("Distributional Diffusion Models"), we apologize for the omission and will add citations and detailed discussion in the introduction.

Although all three use scoring rules and energy distance, they address different problems with distinct frameworks. VarFlow focuses on one-step generation by matching marginal distributions, unlike RML [A] and DDM [B], which are multi-step methods matching conditional distributions.

  • RML [A] proposes a multi-step generative process. It learns a sequence of generators g_t(x_t,y,ϵ)_t=1T\\{g\_t(x\_t, y, \epsilon)\\}\_{t=1}^T, where each generator gtg_t learns the reverse conditional distribution p(x_t1x_t,y)p(x\_{t-1} | x\_t, y). Inference requires chaining these TT models sequentially.

  • DDM [B] also enhances a multi-step process. It learns a single conditional generator G_θ(x_t,t,ξ)G\_{\theta}(x\_t, t, \xi) that models the conditional posterior distribution q(x_0x_t)q(x\_0 | x\_t). Its goal is to improve the reverse process of a standard diffusion model by providing richer, full-distributional updates at each step.

  • VarFlow (Our work) is designed for one-step generation/distillation. It learns a single generator gϕ(z)g_{\phi}(z). The objective minimizes the energy distance between the student's and teacher's noisy marginal distributions, pϕ,t(xt)p_{\phi,t}(x_t) and qt(xt)q_t(x_t).

The following table summarizes the key distinctions:

MethodGoalLearnsDistribution MatchedInference Process
RML [A]Multi-step generationSequence of generators gt(xt,ϵ)t=0T\\{g_t(x_t, \epsilon)\\}_{t=0}^TReverse Conditional: p(x_t1xt)p(x\_{t-1} \| x_t)T-step sequential
DDM [B]Improve multi-step samplingConditional posterior generator Gθ(xt,t,ξ)G_{\theta}(x_t, t, \xi)Conditional Posterior: p(x0xt)p(x_0 \| x_t)Multi-step (DDIM-like)
VarFlow (Ours)Fast one-step generationSingle one-step generator gϕ(z)g_{\phi}(z)Noisy Marginal: pϕ,t(xt)p_{\phi,t}(x_t)One-step parallel

This distinction is crucial. By matching marginal distributions, VarFlow’s objective can be estimated directly from samples, bypassing the key challenge of VSD: approximating the intractable student score xtlogpϕ,t(xt)\nabla_{x_t} \log p_{\phi,t}(x_t). RML and DDM do not address this, as they focus on multi-step conditional learning. Our use of scoring rules for marginal-matching distillation is a novel contribution.

We will revise our manuscript to make these distinctions crystal clear.


2. On Experimental Results

For the larger-scale ImageNet 256x256 and MS COCO 512x512 experiments, we extended the well-established methodology from SMT [1], DMD [2], sCD [3] in large-scale generative modeling.

1. ImageNet 256×256

  • Dataset: Standard ImageNet (1.28M train, 50k val), resized and center-cropped to 256×256.
  • Architectures & Teachers:
    • From Scratch (CT, iCT, SMT): U-Net (~296M params) matching literature baselines.
    • Distillation (PD, CD, DMD2): Student U-Net (296M), initialized from pre-trained EDM-S; teacher EDM-L (550M) with baseline FID 2.70.
  • Training Settings:
    All baselines use AdamW optimizer with learning rate 1e-4 (with warmup and decay), batch size 64, and 400k iterations. Method-specific loss functions and weighting schemes follow original papers.
MethodCategoryArchitecture (Params)Teacher FIDKey Hyperparameters / Notes
CT / iCTFrom ScratchU-Net (296M)N/Aw(t)w(t) per original CT paper
ECTFrom ScratchU-Net (280M)N/ATimestep discretization schedule
SMTFrom ScratchU-Net (296M)N/AOriginal loss formulation
PD / CDDistillationU-Net (296M)2.70Student init from EDM-S weights
DMD2 / SiDDistillationU-Net (296M)2.70Auxiliary student score network
VarFlow (ours)DistillationU-Net (296M)2.70Energy score β=1.0,w(t)=σt2\beta=1.0, w(t) = \sigma_t^2

2. MS COCO 512×512

  • Data: 20M filtered COYO-700M image-text pairs for training; 30k MS COCO 2017 prompts for evaluation.
  • Architecture & Distillation:
    Teacher: SDXL-base model; Students: LoRA fine-tuned SDXL backbone with trainable LoRA layers (rank=64, α=32).
  • Training from Scratch:
    For CT, iCT, SMT, “from scratch” means training U-Net (~296M params) from random init with frozen OpenCLIP ViT-H/14 text encoder.
  • Training Settings:
    • From scratch methods: AdamW optimizer, learning rate 1e-4 (with warmup and decay), batch size 32, 400k iterations.
    • Distillation methods: AdamW optimizer, learning rate 1e-5 (constant), batch size 32, 150k iterations.
    • Noise Scheduling: Same as ImageNet

References:

[1] Jayashankar, T., Ryu, J. J., & Wornell, G. (2025). Score-of-mixture training: Training one-step generative models made simple. ICML 2025 Spotlight.

[2] Yin, Tianwei, et al. "One-step diffusion with distribution matching distillation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.

[3] Lu, Cheng, and Yang Song. "Simplifying, stabilizing and scaling continuous-time consistency models." arXiv preprint arXiv:2410.11081 (2024).


3. Why "Variational Distillation"?

The term "Variational" is used in its broader, modern machine learning sense, which extends beyond the classical calculus of variations. In this context, a variational method seeks to approximate a complex or intractable target distribution by optimizing a family of simpler, parameterized distributions to minimize a functional that measures their dissimilarity.

Our framework fits this description perfectly:

  1. Target Distribution: The teacher's noisy marginal distribution, qt(xt)=q(xtx0)qdata(x0)dx0q_t(x_t) = \int q(x_t|x_0)q_{\text{data}}(x_0)dx_0. This is the complex distribution we aim to match.
  2. Parameterized Family of Distributions: The student's induced noisy marginal distribution, pϕ,t(xt)=q(xtgϕ(z))pz(z)dzp_{\phi,t}(x_t) = \int q(x_t|g_\phi(z))p_z(z)dz. This is our "variational distribution," defined by the generator gϕg_\phi with learnable parameters ϕ\phi.
  3. Functional (Divergence): We minimize the energy distance, DEnergy(pϕ,tqt)D_{\text{Energy}}(p_{\phi,t} \| q_t), a valid Integral Probability Metric (IPM), between the student's and teacher's distributions.

The fact that we estimate this energy distance objective via Monte Carlo sampling does not make it any less of a variational method. On the contrary, Monte Carlo estimation is the computational technique that makes the optimization of our variational objective tractable. This is a standard and powerful paradigm in modern machine learning.

For instance, Variational Autoencoders (VAEs) are a quintessential variational method. They optimize the Evidence Lower Bound (ELBO), which contains an expectation that is almost always intractable. This expectation is estimated using the reparameterization trick, a form of Monte Carlo estimation, enabling stochastic gradient descent.

In both VAEs and our VarFlow, Monte Carlo is the means to a variational end. Our approach is variational because we are searching over a function space (parameterized by ϕ\phi) to find the optimal function (gϕg_\phi) that minimizes a well-defined distributional divergence.

This framing creates a direct and intentional parallel with Variational Score Distillation (VSD).

  • VSD is a variational method that uses KL Divergence as its functional.
  • VarFlow is a variational method that uses Energy Distance as its functional.

Both adhere to the same core principle; they simply employ different (but equally valid) divergence measures to drive the optimization.


4. Time conditioning in the distillation setup.

For one-step generation, the student generator gϕ(z)g_\phi(z) is a function of only the latent code zz. It is trained to directly map zz to a clean sample x~0\tilde{x}_0. Therefore, we set timestep as T.


5. Training Stability and Loss Trajectories

Due to the latest NeurIPS review policy, we are unable to include links in the rebuttal. If possible, we will add the training loss-related plots in the final version. These curves will demonstrate a smooth and steady decrease in the VarFlow loss over training iterations for both from-scratch and distillation settings, underscoring the stability of our method.


6. Toy Dataset Demonstration

This is an excellent suggestion for building intuition. We will include a simple 2D toy experiment in the appendix. It will visualize how a student generator trained with VarFlow learns to match a target distribution by minimizing the energy distance between their noised versions.


Typos and Suggestions

  • Notation pθ(xt,t)(xt,t)p_{\theta}(\cdot|x_t, t)(\cdot|x_t, t): It is indeed a typo. We will correct it to pθ(xt,t)p_{\theta}(\cdot | x_t, t) in the final version.
  • Figure 1 Font Size: We will remake the figure with larger, more legible fonts.

Limitations

We have provide limitation section in Appendix.F


We are very grateful for the thorough review. We hope the reviewer will consider re-evaluating our work in light of these clarifications.

评论

I appreciate the authors' thoughtful responses. I have some additional comments and clarifying questions.

  • 1. On Novelty: Thank you for the discussion and comparison. Including this in the revision will help clarify the manuscript's contribution.

  • 3. On the terminology "variational": I respectfully disagree with the use of the term "variational" to indicate the presence of a Monte Carlo approximation. The proposed method directly trains a parametric model to match the data distribution, similar to maximum likelihood estimation (MLE). We do not refer to MLE as "variational", even though the objective function is a Monte Carlo approximation of the forward KL divergence. The term "variational inference" is used when a new optimization variable is introduced, making the framework variational to render the original optimization tractable. The VarFlow framework does not include such a variational component and can be trained purely from samples. (Additional question: Why is it called "VarFlow" if it does not appear to relate to any notion of "flow"?)

  • 2. On Experimental Results: I appreciate the detailed explanation of the experiments. One clarification I would like to ask about the implementation is the following: In the training-from-scratch setup, it appears the authors were able to train the models with random initialization. Were you able to train all methods (VarFlow and baselines) stably from random initialization, without any special tricks, on the large datasets? If the authors happen to have logged the evolution of FIDs over training iterations, can you share rough trends of the FID scores? (It's fine if you didn't store them.)

评论

On the terminology "variational"

We thank the reviewer for the insightful comment. We agree with the observation regarding the use of the term "variational." Upon reflection, the term may indeed be misleading in this context, as our method does not introduce an auxiliary variational variable or distribution to make the optimization tractable—unlike traditional variational inference frameworks. Instead, the training is based on direct Monte Carlo approximation, similar in spirit to maximum likelihood estimation.

Accordingly, we will remove or revise the use of the term "variational" in the paper to avoid confusion. We also acknowledge the reviewer’s question about the name "VarFlow," and we will provide a clearer justification for the naming or consider an alternative name that better reflects the method’s nature in the revised version.

On Experimental Results

We apologize for the lack of clarity in our original presentation. After initializing the model with random parameters, we follow the two-stage training procedure described in [1], which includes a warmup/pretraining phase: before the main training begins, the generator is pretrained on the dataset using the standard denoising objective of diffusion models (i.e., noise prediction) for 40k–60k steps. This helps provide a good initialization for the generator and significantly accelerates convergence in the subsequent training phase.

[1] Jayashankar, T., Ryu, J. J., & Wornell, G. (2025). Score-of-mixture training: Training one-step generative models made simple. ICML 2025 Spotlight.

Trends of the FID

ImageNet 256×256 (from the scratch, pretraining stage, excluding warmup, eval fid)

Training Steps ×\times 10410^4ours FIDSMT FID
0.018.0318.57
2.512.7914.02
5.010.0112.03
7.59.2711.26
10.09.0210.51
12.58.8910.37
15.08.7110.22
17.58.6110.01
20.08.519.81
22.58.269.54
25.07.529.33
27.57.048.76
30.06.917.62
32.56.746.94
35.06.616.79
37.56.466.62
40.06.446.50
评论

Thanks for the swift and detailed responses. I really appreciate the authors' engagement and feedback. The clarification on the initialization and FID evaluations have cleared out my initial doubts. I believe that this is a really surprising find that the energy distance minimization with multi-noise level training is all you need! That said, I believe that the current manuscript deserves a substantial revision regarding the naming of the method, framing of the methodology, more thorough comparisons with the existing methodologies. I hope the authors can address all these points carefully to make the paper a solid contribution to the community. I am happy to vote for acceptance of this paper.

最终决定

The paper proposes VarFlow, a training objective for either one-step (or few-step) generative models training or distillation on top of pretrained diffusion models. VarFlow minimizes the energy distance between noisy data and noisy model distributions across multiple noise levels instead of using prior score distillation techniques. The paper is appreciated in the following perspectives:

  • It addresses a high-impact problem (fast sampling) with a simple, theoretically grounded objective.

  • It demonstrates state-of-the-art or competitive results without auxiliary score networks and works in the case of training from scratch (with proper warm-up schedule).

  • The paper contains comprehensive, careful experimentation and ablations.

The reviewers have raised their concerns regarding the missing computation report, lack of literature survey, etc., which have been addressed in the rebuttal. The reviewers also notice the claim regarding training from scratch needs clarification and further exploration: how much the model relies on the warm-up. Although no further experiments are requested, the authors need to carefully revise the paper to clarify the training details in order to address these perspectives and avoid any misunderstanding. Other than that, the reviewers agree that the paper makes an important contribution in the scope of diffusion model distillation, and with revision the paper meets the acceptance threshold. Thus recommend an acceptance of this paper.