Inverse Bridge Matching Distillation
摘要
评审与讨论
This work incorporates the technique of score distillation in diffusion models to diffusion bridges for accelerated generation. The empirical performance demonstrates that the proposed approach is superior compared to existing baselines under multiple image-to-image tasks.
给作者的问题
N/A
论据与证据
Most of the claim is well supported despite some comments on related work that needs more careful examination.
方法与评估标准
The proposed method and evaluation make sense for accelerating the sampling for diffusion bridges.
理论论述
The theoretical claims are checked in detail and are correct. However, the equivalent derivation has already been proposed in previous work, and its close connection with the proposed method in this work should at least be discussed (see weaknesses).
实验设计与分析
The experiments are well-executed.
补充材料
I reviewed all the supplementary materials.
与现有文献的关系
This research is related to the broader literature on solving inverse problems.
遗漏的重要参考文献
All key papers are cited, but they are not discussed thoroughly (which should be) in my opinion (see weaknesses).
其他优缺点
Strengths
- This work demonstrates enough empirical significance in accelerating the sampling of diffusion bridges and designing some key techniques (e.g., noise-conditioned one-step generator) tailored for distilling from diffusion bridges.
Weaknesses
-
In my opinion, although the authors mentioned some related works on distillation for diffusion models [1, 2], these works deserve a much more detailed discussion given their strong relevance — or even the use of essentially identical distillation loss functions — to the method proposed in this paper. For example, the high-level objective function of this paper is essentially the Fisher divergence used in [1], and the tractable training objective derived here is mathematically equivalent to those in [1, 2] (e.g., the loss function in this paper corresponds to the SiD () loss in [1] and to the combined loss in [2]). The derivation should be essentially the same as what has been done in [2] (same high-level objective and same final loss function). Although the focus shifts from diffusion models to diffusion bridges, and the motivation for adopting a Fisher divergence-like objective is different (which I think is a positive contribution if you can elaborate more on the connection between the KL for path measures and Fisher divergence), I still believe that better contextualization of these related works is necessary and would greatly benefit the community’s understanding of this research direction.
-
I feel the claim that "previously proposed samplers and consistency models can not work with unconditional bridges" needs more justification than provided and should be re-examined. For example, [3] develops DDIM-like samplers for I2SB, which is an unconditional bridge. For consistency models, it only requires that we can simulate the PF-ODE, which I believe is plausible once we know the drift of the unconditional bridge, given it is a Markovian process.
I am happy to raise my score if these are properly addressed.
[1] Zhou, Mingyuan, et al. "Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation." (ICML 2024).
[2] Huang, Zemin, et al. "Flow generator matching." arXiv:2410.19310 (2024).
[3] Wang, Yuang, et al. "Implicit Image-to-Image Schrodinger Bridge for Image Restoration." arXiv:2403.06069 (2024).
其他意见或建议
N/A
Dear Reviewer kKRF, thank you for your comments.
(1) Relation to SiD [1] and FGM [2]. Following your suggestion, we will extend the discussion of the related work in the main text. Below, we present the preliminary version of the extension:
Unlike SiD [1] and FGM [2], we focus on diffusion-bridge models used for data-to-data translation and not for generation from noise. Furthermore, our high-level objective is the KL divergence between path measures of teacher model and path measure given by the generator , which differs from Fisher Divergence used in SiD [1]. The motivation of our high-level objective is to restore alignment between data pairs . We derive our final tractable objective using different techniques (see Appendix A), i.e., we do not use flow product or score product identities as in [1, 2] but hypothesize that analogous identities can be derived for diffusion bridge models. Our final objective can be rewritten similarly to the final objective used in SiD [1, Eq. 23] () and FGM [2, Eqs. 4.11, 4.12]:
However, in both SiD [1] (for used in experiments) and FGM [2], the authors either omitted the quadratic term or even used a negative coefficient for it in image experiments since it introduced instability. Unlike SiD [1] and FGM [2], we do not omit any parts of the original loss function and use it as the theory provides it.
Relation between KL divergence of path measures and Fisher Divergence. To highlight the difference between KL divergence of path measures (which we use) and Fisher Divergence (which is used in SiD [1]), consider two reverse-time diffusions and given by the same starting distribution and SDEs:
Denote marginal densities as for and for , then KL divergence and Fisher Divergences are given as:
Note that in SiD [1], the authors use the time average of Fisher Divergence, which compares only marginal distributions and . However, two path measures with the same marginal distributions might not be equal. As a result, the minimization of Fisher Divergence with the teacher does not guarantee that the learned model will transform data in the same way as the teacher model. Nevertheless, Fisher Divergence allows one to build a Generator to produce data , by matching marginals of data and generated samples. In contrast, we use KL-divergence between two path measures, which is zero if and only if two path measures are identical, since we need to get generations aligned to data coupling.
(2) Acceleration of the unconditional diffusion bridge models.
To obtain PF-ODE for the diffusion bridge model, one needs to subtract forward and reverse SDE drifts [4, end of Sec. 4]. For the conditional case, the drift of a forward process is known analytically from Doob h-transform. In the unconditional case, the drift of a forward unconditional diffusion is unknown. Hence, to restore PF-ODE in the unconditional case, one must learn an additional teacher model for forward-time translation, which is time-consuming.
We will add that I3SB [3] is used to accelerate the unconditional bridge model. Their sampler coincides with the DBIM sampler but replaces the conditional model with an unconditional one to sample . Their sampler provides good quality for moderate NFE (25+) but performs worse than distillation methods in a single NFE regime, e.g., for JPEG-10, their FID is 17, while ours is 3.8 (we will add it to the text).
Concluding remarks. We would be grateful if you could let us know if our explanations have been satisfactory. If so, we kindly ask that you consider increasing your rating. We are also open to discussing any other questions you may have.
References:
[1,2,3] the same.
[4] Shi Y. et al. Diffusion Schrödinger Bridge Matching.
Thank the authors for their rebuttal. Please make sure to add the discussion in the main text.
I have a few follow-up comments:
-
I agree with your arguments about the difference between KL w.r.t. path measures and Fisher divergence, but I also think that their connection under the setting of diffusion models/diffusion bridges should be explicitly discussed. For diffusion models, the reverse drift is fully characterized by the marginal score function , and thus, they are equivalent. For diffusion bridges, one can draw similar connections as the optimal drift is also given by . The only difference I have seen here is that is no longer equal to the marginal score as in the case of diffusion models. Given that, the derivation of IBMD wouldn't be fundamentally different from the diffusion model case.
-
Besides, I am hesitant to accept the argument "We derive our final tractable objective using different techniques (see Appendix A), i.e., we do not use flow product or score product identities as in [1, 2]". To me, the core techniques used for the derivations in Appendix A is also the score identities used in [1, 2], i.e., .
In conclusion, my point is that what matters here is the connection (or the technical equivalence) between them under this specific setup rather than their difference, which is valid in general but does not apply much in this setup. Although I appreciate your arguments about the differences between them, and I think this is a nice contribution to the paper.
Nevertheless, I have adjusted my rating, and I encourage the authors to discuss the relationship between their methods and the mentioned score distillation techniques for diffusion models more thoroughly in the revised version.
(minor)
- I agree with your arguments for learning the PF-ODE that characterizes the marginal distribution of the unconditional bridge is demanding. But I am wondering that, for generation purpose, one may establish the ODE w.r.t. the conditional distribution of the unconditional bridge. For example, the deterministic sampler in [3] should also correpods to an ODE trajectory?
Thank you for your valuable feedback.
(1) New version of the relation.
We extended the part about the relation between KL divergence of path measures and Fisher Divergence based on your feedback. We included more details specific to the diffusion and diffusion bridge models.
Relation between KL divergence of path measures and Fisher Divergence. To highlight the difference between KL divergence of path measures (which we use) and Fisher Divergence (which is used in SiD [1]), consider two reverse-time diffusions and given by the same starting distribution and SDEs:
Let and be the corresponding marginals. Then the KL divergence and Fisher divergence are given by:
In SiD [1], Fisher divergence is averaged over time and compares only marginal distributions and of two path measures. However, two path measures with the same marginal distributions might not be equal — thus, in general, minimizing Fisher divergence does not guarantee that as stochastic processes. In the case of classical diffusion models where the forward drift is fixed, reverse drifts are fully determined by marginal score functions:
Substituting these into the KL expression shows that in this specific setting — with a fixed forward SDE — KL divergence between path measures becomes equivalent (up to a constant) to the time-averaged Fisher divergence between the marginals. This explains why Fisher-based methods like SiD [1] may succeed in this context.
However, this equivalence breaks down in the case of unconditional bridge matching. Here, the forward drift is not fixed and depends on the data coupling . In turn, the forward drift for the generated coupling also depends on . As a result, , and the reverse drifts cannot be expressed solely in terms of marginal scores. Hence, KL divergence in the case of unconditional bridge matching is not equivalent to Fisher divergence between marginals. This difference is expected since, in the case of an unconditional diffusion bridge, one does not have a fixed forward process, which specifies the "dynamic part" of the measure. This highlights the importance of using KL divergence between path measures as a high-level objective instead of the previously used Fisher Divergence.
(2) Regarding score and flow product identities. We agree that the used property is similar to score identity. We will remove this sentence in the extension of related work.
(3) PF-ODE. In I3SB [3], the authors state in Theorem 1 that their sampler coincides with the PF-ODE of the Variance Exploding fixed Schrödinger Bridge (SB). By fixed, the authors assume that we consider SB between some distribution and for a fixed . This SB coincides with the forward diffusing given by Doob h-transform obtained for a VE SDE variance process, i.e., one considered in DDBM/DBIM papers. It follows from the result that Schrödinger Bridge for VE SDE is the unique process that is Markovian and is a mixture of VE SDE bridges. Both conditions are satisfied since the authors of DDBM/DBIM use Doob h-transform for VE SDE. This result for a general Markovian SDE (not only VE SDE) can be found in [5, Theorem 2.12].
The authors of I3SB in Theorem 1 omitted that should also depend on , i.e., one should use since this PF-ODE is derived for the SB with a fixed . If we use this PF-ODE for a general case of bridge diffusion from , but with approximated by the unconditional model, then we will indeed obtain some ODE trajectories, but we do not know any theoretical guarantees on what this ODE will produce.
Concluding remarks. We would be grateful if you could let us know if our explanations have been satisfactory. If so, we kindly ask that you consider increasing your rating. We are also open to discussing any other questions you may have.
New reference:
[5] Léonard, C. (2013). A survey of the Schrödinger problem and some of its connections with optimal transport. arXiv preprint arXiv:1308.0215.
The paper introduces IBMD, an inverse bridge matching distillation method for inverse problems. The key idea is to consider bridge matching distillation as an inverse problem and convert the constrained problem into an unconstrained one using the reparameterization trick. Based on the teacher models DDBM and , distilled models with IBMD achieve low FID with fewer NFEs in super-resolution, image restoration, image inpainting and image-to-image translation compared to other baselines.
Update after rebuttal
I have no concerns regarding the submission, therefore, I will maintain my original rating of 4.
给作者的问题
How much time does it take to distill each model each model and DDBM?
论据与证据
-
The proposed distillation method reduces NFEs in the inverse problem while preserving the teacher model’s performance. As far as I know, this is the first distillation approach for both unconditional and conditional inverse problems. The equations and derivations are solid.
-
It would be better to clarify the reasoning behind the statement in Section 3.2, line 240: 'The key difference in the reformulated problem is that it admits clear gradients of the generator .' For example, explaining that all parts are differentiable would help.
方法与评估标准
-
The authors follow the typical setup for inverse problems (super-resolution, image restoration, translation, and inpainting) and select appropriate teacher models (unconditional for and conditional for DDBM) along with suitable baselines.
-
One curious point is that multi-step distillation is applied to the distillation of DDBM but not to CBD and CBT. What if single-step distillation is applied to the proposed approach? How would the results change in the metric?
理论论述
I have checked Proposition and Theorems 3.1–3.4, along with their derivations in the Appendix. The proofs seem correct.
实验设计与分析
The experimental designs are valid.
补充材料
I have checked the Appendix material, code in the supplementary material.
与现有文献的关系
Due to the reduced NFEs achieved by the proposed method, it will be more applicable to real-world inverse problem applications.
遗漏的重要参考文献
No.
其他优缺点
Strengths
- Treating bridge matching distillation as an inverse problem is novel.
- The derivation for reparameterization is solid.
- It enables one-step inference.
Weaknesses
- The qualitative results for inpainting are not satisfactory.
其他意见或建议
It would be better to use the same notation for 'single-step' and 'multi-step' in line 99.
Dear Reviewer bxNd, thank you for your comments. Here are the answers to your questions and comments.
(1) It would be better to clarify the reasoning behind the statement in Section 3.2, line 240: 'The key difference in the reformulated problem is that it admits clear gradients of the generator .' For example, explaining that all parts are differentiable would help.
Thank you for this suggestion. In the final version, we will clarify the differentiability of all parts of the final objective.
(2) One curious point is that multi-step distillation is applied to the distillation of DDBM but not to CBD and CBT. What if single-step distillation is applied to the proposed approach? How would the results change in the metric?.
If we correctly understand, you asked about single-step distillation applied to the result of the multi-step distillation. Indeed, we applied multi-step distillation to the original teacher model, e.g., I2SB and DDBM, but not to the already distilled models like CBD or CBT. We did so, since our method is designed for distillation of diffusion-bridge models (like I2SB and DDBM), while CBD and CBT are consistency models obtained from DDBM. Result of multi-step IBMD (ours) distillation is also not a diffusion bridge model. Due to that we do not consider one-step distillation of multi-step distilled models.
(3) It would be better to use the same notation for 'single-step' and 'multi-step' in line 99.
Thank you for this suggestion. We will change it in the final version.
(4) How much time does it take to distill each model each model I2SB and DDBM?
We present the training time of each model below:
| Task | Teacher | Dataset | Approximate time on 8A100 | NFE |
|---|---|---|---|---|
| super-resolution (bicubic) | I2SB | Imagenet | 40 hours | 1 |
| super-resolution (pool) | I2SB | Imagenet | 40 hours | 1 |
| JPEG restoration, QF | I2SB | Imagenet | 24 hours | 1 |
| JPEG restoration, QF | I2SB | Imagenet | 40 hours | 1 |
| Center-inpainting () | I2SB | Imagenet | 24 hours | 4 |
| Center-inpainting () | DDBM | Imagenet | 12 hours | 4 |
| Sketch to Image | DDBM | Edges/Handbags | 40 hours | 1 |
| Sketch to Image | DDBM | Edges/Handbags | 1 hour | 2 |
| Normal to Image | DDBM | DIODE-Outdoor | 48 hours | 1 |
| Normal to Image | DDBM | DIODE-Outdoor | 7 hours | 2 |
About 75% of this training time is used to get the last 10-20% decrease of FID (e.g., drop from 3.6 to 2.5 FID in pooling SR setup or from 4.3 to 3.8 FID in JPEG with ), while training for the first 25% of time already provides a good-quality model. On Sketch-to-image and Normal-to-image in multistep regime with 2 NFEs, convergence appears faster than in the corresponding single-step version. We will add the approximate time used for training to Table 7 of Appendix B (Table with all hyperparameters).
Concluding remarks. We would be grateful if you could let us know if the explanations we gave have been satisfactory in addressing your concerns and questions about our work. We are also open to discussing any other questions you may have.
Thank you for your rebuttal. I have read it carefully, along with the other reviews.
I have no concerns regarding the submission, therefore, I will maintain my original rating of 4.
This paper proposes a new distillation scheme for diffusion bridge models. The main idea is to parameterize the entire formulation based on stochastic generator G. The student must follow input-output pairs produced by G, which constrains the path of diffusion bridge, which must coincide with the teacher path. Optimizing G, one can find a viable student. To make the formulation tractable, the constrained problem is reformulated into an unconstrained problem, resulting in a bilevel optimization problem that is somewhat GAN-like. The generator G becomes the resulting one-step generator, and the difference between the teacher error and the "student error" acts as a discriminator. G can also be designed in multi-step fashion. The entire formulation can be applied to both unconditional bridge matching and conditional bridge matching. Experiments show that the proposed method provides state-of-the-art results with much fewer steps than existing methods.
update after rebuttal
All the reviewers have rated the paper positively. The additional results provided in the rebuttal are also convincing. I maintain my original score.
给作者的问题
The main question I have is about the meaning of the bilevel formulation. In my understanding, the second term with \phi here acts as an "expander," which means, G can become trivial (or can collapse) without this term. Initially, I thought that \phi is the student and is the resulting model. However, after reading the whole paper, I realized that G is the final model, and even if \phi looks like a student, it is actually an auxiliary component. In other words, the proposed method trains another DBM just to help train G. (This makes the whole formulation quite heavy because now we have two additional models G and \phi. How long does it take for training?)
I'd like to ask whether this interpretation is right, and I'd like to see a deeper discussion in the paper regarding the role of \phi. Currently, there is not much of a discussion about the meaning of the final formulation.
论据与证据
The proposed formulation based on stochastic generator G, as well as the bilevel reformulation, is sound and novel. I checked the proof and they are correct in my opinion.
方法与评估标准
As mentioned above, the method and the proofs are sound. The method was evaluated on popular benchmark datasets.
理论论述
As mentioned above, the proofs ((a) the parametrized matching problem becomes constrained optimization (9), (b) it can be reformulated as an unconstrained bilevel optimization, and (c) it can be reparametrized based on the denoisers/samplers) are sound.
实验设计与分析
The method was evaluated on popular benchmark datasets, and the experiments show most important metrics (NFE, FID, CA). The proposed method shows state-of-the-art performance with much fewer sampling steps.
补充材料
I focused on the proof part. I briefly checked the rest.
与现有文献的关系
Diffusion bridge models are an important topic in diffusion models, and they can be used in many data-to-data translation problems. Providing a faster sampling method for diffusion bridge models can benefit many related areas.
遗漏的重要参考文献
I believe the bibliography is thorough.
其他优缺点
The proposed distillation technique based on the G formulation is sound and novel. This particular formulation allows for the handling of both unconstrained and constrained diffusion bridge models. One downside is that the learning procedure can be quite complicated and heavy, as also pointed out in the Discussion section.
其他意见或建议
Please see the below question.
Dear Reviewer AN8r, Thank you for your comments. Here are the answers to your questions and comments.
(1) In my understanding, the second term with \phi here acts as an "expander," which means, G can become trivial (or can collapse) without this term ... In other words, the proposed method trains another DBM just to help train . I'd like to ask whether this interpretation is right, and I'd like to see a deeper discussion in the paper regarding the role of .
Yes, this interpretation is correct. To show it more formally, note that the minimal value of the inner problem is the averaged variance of :
= \mathbb{E}\_{p\_{\theta}(t, x\_t, x\_T)} \Big( \lambda(t) \Big $ \underbrace{\mathbb{E}\_{p\_{\theta}(x\_0|x\_t, x\_T)} \big[ \| \mathbb{E}\_{p\_{\theta}(x\_0|x\_t, x\_T)}[x_0] - x_0 \|^2 \big]\Big]}\_{\text{Variance of } p\_{\theta}(x\_0|x\_t, x\_T)}\Big $ \Big).For , this is directly the variance of the generator . Since we are maximizing this part over , it enforces the generator to produce more diverse outputs and avoid collapsing.
Following your advice, we will add this discussion on the interpretation of how the auxiliary model helps to train the Generator to the final version of the paper.
(2) How long does it take for training?
We present the training time of each model below:
| Task | Teacher | Dataset | Approximate time on 8A100 | NFE |
|---|---|---|---|---|
| super-resolution (bicubic) | I2SB | Imagenet | 40 hours | 1 |
| super-resolution (pool) | I2SB | Imagenet | 40 hours | 1 |
| JPEG restoration, QF | I2SB | Imagenet | 24 hours | 1 |
| JPEG restoration, QF | I2SB | Imagenet | 40 hours | 1 |
| Center-inpainting () | I2SB | Imagenet | 24 hours | 4 |
| Center-inpainting () | DDBM | Imagenet | 12 hours | 4 |
| Sketch to Image | DDBM | Edges/Handbags | 40 hours | 1 |
| Sketch to Image | DDBM | Edges/Handbags | 1 hour | 2 |
| Normal to Image | DDBM | DIODE-Outdoor | 48 hours | 1 |
| Normal to Image | DDBM | DIODE-Outdoor | 7 hours | 2 |
About 75% of this training time is used to get the last 10-20% decrease of FID (e.g., drop from 3.6 to 2.5 FID in pooling SR setup or from 4.3 to 3.8 FID in JPEG with ), while training for the first 25% of time already provides a good-quality model. On Sketch-to-image and Normal-to-image in multistep regime with 2 NFEs, convergence appears faster than in the corresponding single-step version. We will add the approximate time used for training to Table 7 of Appendix B (Table with all hyperparameters).
Concluding remarks. We would be grateful if you could let us know if the explanations we gave have been satisfactory in addressing your concerns and questions about our work. We are also open to discussing any other questions you may have.
This paper focuses on improving the inference efficiency of diffusion bridge models, enabling single-step generation for both conditional and unconditional tasks. All reviewers recommend acceptance of this work, highlighting the novelty of the proposed method and its evaluation across a wide range of experimental settings. AC recommends acceptance. The authors are encouraged to include the reviewer discussions in the final paper.