On the Guidance of Flow Matching
We introduce the first framework for general flow matching guidance, from which new guidance methods are derived and many classical guidance methods are covered as special cases.
摘要
评审与讨论
In this paper, the authors provide a general flow matching framework for conditional generation based on an energy function. The authors propose a unifying framework that constructs guidance for arbitrary source distributions and couplings. They derive various guidance methods based on MC estimation, which then they approximate with Taylor expansion, furthermore simplify using affine path assumptions, and finally assume stronger uncoupled affine Gaussian paths. Each of these provides one guidance method, which are all theoretically justified and then empirically analyzed. Then, the authors also provide derivative-free guidance and an inverse problem guidance. Finally, the authors also give training-based exact guidances. These are all examined on toy examples, RL and image problems.
给作者的问题
If the authors could address my concerns above I would be willing to increase my score.
论据与证据
Upon reviewing this paper, I have identified several claims that appear to require further clarification or support. Firstly, the title and abstract may be perceived as somewhat misleading, as they refer to "Guidances of Flow matching" without explicitly indicating that the primary focus is on energy-guided sampling. To avoid potential confusion, I think the authors should change the title and abstract to emphasize this key aspect of the work.
In Theorem 3.1, the authors assume that \mathcal{P} equals or can be approximated as 1. Then, the authors say that it is reasonable for independent couplings or mini-batch OT flow matching with small batch size. Doesn't sufficiently small batch size OT basically behave almost as independent couplings? The reason why I am pointing this out is that later, at beginning of page 4, the authors put a lot of emphasis on "guidance substantially different from diffusion one", and as one of the examples give dependent coupling, which does not seem convincing given this assumption on \mathcal{P}.
Furthermore, I question the claim that experiments on synthetic datasets demonstrate the effectiveness and correctness of the proposed guidance methods. While some methods do perform well, others struggle to replicate simple base distributions, as illustrated in Figure 2. Specifically, only two methods (g^MC and g^\phi) achieve satisfactory results, however the Monte Carlo method requires a nested MC estimation, which may not be practical. Furthermore, similar limitations are observed in the image experiments, where the authors acknowledge that g^MC does not perform well due to high complexity and sample budget constraints.
Regarding classifier-free guidance, the authors mention its relevance but do not provide sufficient context or explanation. According to existing literature (e.g., Lipman et al.'s "Flow Matching Guide and Code"), classifier-free guidance can be readily applied using specific transformations between velocity fields and score functions, even though the theoretical derivations in their paper in order to establish CFG motivation are limited to Gaussian paths. The authors briefly mention CFG through "These losses open the design space of the training loss for classifier-free guidance of flow matching (Ho & Salimans, 2022; Zhang et al., 2023).", however it is unclear what the authors meant by this. I recommend rephrasing or removing the sentence to accurately reflect this.
方法与评估标准
The proposed methods and evaluation criteria appear to be well-reasoned and sound. However, I have some questions regarding the results presented in Figure 2. Specifically, it is intriguing that only two of the proposed methods demonstrate satisfactory performance on several simple and toy examples. This raises concerns about the generalizability and robustness of the other methods.
What is really surprising is that some of these methods seem to exhibit significant improvements when applied to more complex problems. It would be helpful if the authors could provide further insight into this, as it may indicate that there are specific conditions or characteristics of the problems that favor certain methods over others.
理论论述
I did not go through the proofs in the supplementary material but the theoretical claims in the main paper seem correct. However, the calculation of g_t^local should be referenced in the main paper as it is not clear how the formulation follows directly. If I am not mistaken, this is calculated in line 1080 in the appendix? If so, shouldn't the first line of Eqn (68) say g^local rather than g^cov?
实验设计与分析
Please see above.
补充材料
I did not review the supplementary material in detail.
与现有文献的关系
The contribution seem timely and relevant, particularly considering the importance and need of guidance methods in flow matching.
遗漏的重要参考文献
I am not aware of any relevant works that exist but have not been cited.
其他优缺点
N/A
其他意见或建议
- Figure 1. is not completely clear. It would be nice if there was an explanation of it in the supplementary material at least.
- Shouldn't the first sentence on in Section 2. Background, say data samples x_1 rather than x_t?
- Furthermore, line 060, it should say "it has been proven" rather than "it has been proved"
- Lines 154-156, where it is stated that "the new VF has the conditional probability path as that of the original" this is unclear why. The authors should elaborate further.
Thank you for your detailed review and for acknowledging the contribution and potential impact of our work. We will address your concerns in the following:
Q1: Misleading expression of "guidance"
We apologize for the confusion. We will switch to 'energy guidance' to further enhance clarity.
Q2: The assumption of \mathcal{P}=1 seems to contradict the claim in the paper that the guidance is substantially different from diffusion guidance.
- From the empirical perspective, we argue that is a valid choice for dependent couplings in realistic datasets. As we show in the table below, the learned VF of the OT-CFM in CelebA and uncoupled CFM have small relative error at all flow time steps, so their guidance VFs are approximately the same, validating our approximation of .
| Flow time | 0.05 | 0.25 | 0.5 | 0.75 | 0.95 |
|---|---|---|---|---|---|
| Relative L2 |
-
Theoretically, with a slow-varying , the approximate holds for any coupling. A detailed discussion can be found in our response to reviewer YEWM (Q2).
-
We would like to emphasize that in addition to dependent couplings, our framework is also substantially different from diffusion guidance because (a) it extends to any source distribution and conditional probability paths as is also recognized by reviewer erey and reFm (b) a different theoretical framework is provided, as our derivations do not start from score-based models.
Q3. The Monte Carlo guidance has the limitation of lower sample efficiency, so is it practical?
- We argue that is practical (at least in some tasks) as it shows satisfactory performance in the widely used offline RL task of D4RL. The dimensionality () is already high enough in many practical generative modeling tasks, including generative modeling and molecular structure generation.
- Besides, there are many variance-reducing MC techniques, such as importance sampling. Specifically, we can use another guided VF to sample from an alternative distribution such that has lower variance. For the details of the new , please refer to our response to reviewer YEWM (Q1). It should be noted that this estimation is still unbiased. We conduct experiments to validate the effectiveness of this method, and the results in the table below showcase a huge improvement comparable to the state-of-the-art methods on the image inverse problem, where the vanilla fails.
| Method | FID | LPIPS | PSNR | SSIM |
|---|---|---|---|---|
| 7.863 | 0.1889 | 23.63 | 0.8429 | |
| 22.75 | 0.5589 | 8.67 | 0.3484 | |
| 15.27 | 0.1753 | 25.48 | 0.8700 |
Q4. Why do many methods perform poorly in toy examples?
Actually, the toy examples are not trivial: sampling from requires guidance theoretically exact. Otherwise, the samples will be distorted or exhibit mode collapse. Gradient-based guidance methods (, ) rely on rough approximations; while approximate guidance (, CEG) produces biased guidance vector fields, leading to distorted sample distribution.
Q5. Why do these methods exhibit significant improvements when applied to more complex problems?
Not all tasks require exact energy-guided sampling as in the toy example. There are tasks where sampling from a distorted, rather than exact, target distribution may be practically useful, as long as both and are high. The image inverse problem in our experiment is one instance where the approximate methods demonstrate improved performance. On the contrary, the offline RL task is conditioned on different initial conditions, necessitating stable guidance, so theoretically exact energy guidance is more robust.
Q6. More explanation is required for classfier-free guidance in the current framework (End of section 3.5, line 347-349)
In our framework, directly following our definition of the general guidance VF, the classifier-free guidance is simply subtracting the original VF from the conditioned VF (which also extends previous CFG to dependent coupling and arbitrary source distributions). However, as you pointed out, our training-based guidance is actually the flow matching version of the diffusion classifier-based guidance, rather than the classifier-free guidance. We will revise this paragraph and move this to the discussion section, as well.
Q7. The calculation of is unclear.
You are correct. We will add the derivation of in the main text in the revised version.
Q8. Other typos.
Thank you. We will fix these. For 4, the conditional probability path is a design choice in flow matching, and we have thus assumed the new conditional probability path to be identical. We will make the assumption more explicit.
Thank you for addressing my concerns. I strongly hope that you will change the name to avoid future confusion to "energy guidance". I have thus raised my score.
Thank you for your support. We will change the name.
This paper proposes a general framework for guidance in continuous flow matching that includes arbitrary source distributions, conditional paths, and coupling (to some extent). Guidance is viewed as tackling energy-based sampling or posterior sampling given an existing flow matching model. This general theory is shown to cover past instances of flow matching guidance as approximations and the framework leads to several new guidance methods including an asymptotically exact Monte-Carlo version. The paper considers both training-free and training-based guidance, and also introduces several loss functions for training the guidance based on the framework. The methods are then evaluated and compared on synthetic datasets, image inverse problems, and offline RL settings, where performance differences are understood relative to the approximations being made.
update after rebuttal
The rebuttal addressed performance in high dimensions and showed Monte-Carlo guidance could be useful for image inverses. While the training-based losses' performance in practical tasks seems still lacking, I increased my score as the rebuttal increased my confidence that the results are both theoretically and practically valuable.
给作者的问题
- Do you anticipate that the training-based losses and / or asymptotically exact Monte-Carlo guidance will be useful in high dimensions? To my understanding, the training-based methods were generally unhelpful outside the synthetic dataset example and were not tried in the image inverse setting. Further, the Monte-Carlo guidance was also not useful for image inverses as discussed in Appendix C.3.
论据与证据
- Claims are supported throughout. The framework is demonstrated useful through both organizing and understanding past research, as well as understanding relative performance on the experimental tasks
方法与评估标准
- The choice of methods, datasets, and evaluation criteria are appropriate for analyzing guidance.
理论论述
-
Checked proofs briefly in Appendix which appeared correct, but did not go through in detail.
-
The guidance in Theorem 3.1 is exact with respect to the true marginal vector field, but not necessarily a trained marginal vector field. This is not an correctness issue in the proofs, but could be emphasized more in the main text.
-
Similarly, the conditional probability path and conditional VF are assumed the same for the guided distribution, and most of the paper considers independent coupling. While the framework is still quite general, the authors could clarify these limitations further.
实验设计与分析
- Experiment details for the RL, image inverse, and synthetic data setting appears sound
补充材料
- Reviewed Sections A, B, C including the proofs, limitations, and experimental details
与现有文献的关系
Key contributions:
- Framework to understand flow matching guidance covering and organizing many past approaches
- General guidance methods that expand the scope of when guidance can be applied (arbitrary sources, conditional probability paths)
遗漏的重要参考文献
None
其他优缺点
Strengths:
- Generally well-written with ample details provided, Figure 1 is a helpful reference organizing past research on flow matching guidance
- Code to reproduce experiments is provided
- Highly likely to be used as a reference for future research on guidance with flow matching models
Weaknesses:
- Usefulness of the Monte-Carlo guidance and the training-based losses in practical settings is underdeveloped
其他意见或建议
- Figure 1 contains notation (see general guidance expression) that disagrees with the rest of the text.
Thank you for acknowledging the theoretical and empirical soundness, the presentation, and ample details of our work, as well as the contributions and potential benefits for future works in the field. We would like to address your concerns in the following:
Q1: The correctness of guidance VF assumes the trained VF to be identical to the true VF.
Thank you for raising this question. We indeed made this assumption throughout the paper. We will make it explicit in the revised manuscript.
Q2: There are assumptions on the specific type of guidance VF and coupling, limiting the framework's generality.
Thank you for your comment, and we appreciate your recognition of the overall generality of our guidance. We will add a more detailed discussion on these assumptions we made in the revised version, but would like to make the following clarifications:
For the conditional VF assumption, it is indeed an interesting research question to explore other possible conditional probability paths that may have different advantages, e.g., enhancing "straightness" of the VF for accelerated sampling. Nevertheless, our assumption of conditional VF is natural, and it allows us to simplify the form of guidance, covering many existing guidance methods.
For the assumption of in the dependent coupling case:
- Empirically, we argue that is a valid approximation for dependent couplings in realistic datasets such as CelebA in our experiment. As we show in the table below, the learned VF for mini-batch OT CFM (batch size as large as 128) and uncoupled CFM are close at different time steps, with a relative error of , so their guidance VFs can be approximately the same, which validates our approximation of .
| Flow time | 0.05 | 0.25 | 0.5 | 0.75 | 0.95 |
|---|---|---|---|---|---|
| Relative L2 |
- Theoretically, the assumption holds for any dependent couplings when is slow-varying. A more detailed discussion can be found in our response to reviewer YEWM (Q2).
Q3: Inconsistency of the general guidance expression between Figure 1 and the main text.
Thank you for pointing this out. We will fix it in the revised manuscript.
Q4: Are MC and training-based guidance useful in practice? What do you think of the experiment results of these guidance in high dimensions?
- We believe is of practical use in high dimensions.
First, is practical (at least in some tasks) as it shows satisfactory performance in the widely used offline RL task of D4RL, where the sample dimensionality is not very small ( per sample).
Besides, there are many techniques that can be readily applied to enhance the efficiency of . For example, we can adopt importance sampling to reduce the variance of the MC estimation. Specifically, using
we can select an alternative distribution such that has lower variance (i.e., when is large, is also large, and vice versa). This can be achieved by using another guided VF to sample from such as using and then can be estimated using, for example, the Hutchinson trace estimator to preserve scalability [1]. It should be noted that this estimation is still unbiased. We conduct experiments to validate the effectiveness of this method, and the results in the table below showcase a huge improvement comparable to the state-of-the-art methods on the image inverse problem, where the vanilla fails.
| Method | FID | LPIPS | PSNR | SSIM |
|---|---|---|---|---|
| 7.863 | 0.1889 | 23.63 | 0.8429 | |
| 22.75 | 0.5589 | 8.67 | 0.3484 | |
| 15.27 | 0.1753 | 25.48 | 0.8700 |
- As for the training-based guidance, it has the potential to be widely applicable because it demonstrates the theoretical exactness in the synthetic dataset experiments. Although currently, its performance is restricted, possibly due to high variance induced by the dependency among multiple neural networks in training and inference, it can be potentially addressed by dynamically estimating rather than training another NN to approximate it, or fine-tuning the guidance VF on the actual original VF.
[1] Lipman, et al., Flow Matching for Generative Modeling, ICLR 2023.
This paper provides a unified perspective on the guidance of flow matching and proposes a bunch of guidance methods to make it more general. The result shows the relevance between the tasks and guidance methods.
给作者的问题
Can P be dynamically estimated to relax the P ≈ 1 assumption?
论据与证据
All the claims are clear and convicing.
方法与评估标准
This paper proposes different guidance methods for different tasks.
理论论述
NA
实验设计与分析
NA
补充材料
NA
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
Strengths:
- This paper provides a unified Framework for flow-matching guidance applicable to various distributions and couplings.
- The training-free (g^MC) and training-based (g_φ) guidance techniques are novel.
- Demonstrates effectiveness across synthetic data, image tasks, and RL, showcasing adaptability.
- Contributions and derivations are well-structured and accessible. Weakness:
- Does not fully address scenarios where couplings significantly influence outcomes.
其他意见或建议
Test more tasks to highlight real-world utility.
Thank you for acknowledging our contribution in the theoretical framework, the novelty of guidance methods, the soundness of our empirical validation, and the clarity of presentation. We will respond to your questions below:
Q1: How to further address the scenarios where the approximation does not apply?
First, we would like to emphasize that is a valid approximation for dependent couplings in realistic datasets such as CelebA in our experiment. As we show in the table below, the learned VF of mini-batch OT CFM (batch size as large as 128) and uncoupled CFM are close at different flow time steps, with a relative error of , so their guidance VFs can be approximately the same, which validates our approximation of .
| time | 0.05 | 0.25 | 0.5 | 0.75 | 0.95 |
|---|---|---|---|---|---|
| Relative L2 |
To produce exact guidance for dependent couplings, one approach is to dynamically adapt the guidance VF as you pointed out. Since changes high-dimensional integrals, it is costly to directly compute its influence. However, if we make assumptions on the form of of the original VF, we can derive the corresponding and approximate guidance. Meanwhile, the approximation error caused by can also be compensated by adapting the source distribution as we discussed in our response to YEWM (Q2). Therefore, we can also parameterize the source distribution and optimize it to recover the exactness of the energy guidance. We will add a discussion on these future directions in the revised manuscript.
Q2: Test more tasks to highlight real-world utility.
Thank you for your suggestions. The experiments in our paper include data modalities from time-series data to images, and are realistic and high-dimensional. Therefore, we believe our empirical evaluation can effectively reveal the utility of different guidance methods on different types of realistic generative modeling tasks. In addition, we provide an additional experiment on a class-label conditioned image generation task to increase the variety of guidance energy functions. With the trained CFM model in our image inverse problem experiment, we use another classifier on the gender to produce the objective function , and consider two cases where either the male or female is set as the target. The table below shows the classification accuracy of the samples generated by different guidance methods:
| Guidance \ Target label | Male | Female |
|---|---|---|
| 98% | 85% | |
| 82% | 68% |
These results reveal that on image generation tasks, the performance gap between and is narrowed when the objective function is less complicated.
The paper introduces a framework for guiding flow-matching models, which are advanced generative models. It extends the concept of guidance from diffusion models to a more general form. The framework includes training-free, asymptotically exact guidance using Monte Carlo methods, new training-based guidance losses, and approximate guidance methods that cover classical guidance techniques as special cases. The paper validates these innovations through theoretical analysis and experiments on synthetic datasets, offline reinforcement learning, and image inverse problems, demonstrating improved effectiveness and flexibility.
update after rebuttal
给作者的问题
- Could the efficiency of the Monte Carlo sampling method be further improved for high-dimensional image tasks?
- What are the major limitations you're currently addressing in P estimation for strong coupling?
论据与证据
The claims in the paper are supported by thorough theoretical derivations and extensive experimental results. The introduction of the Monte Carlo-based training-free guidance is demonstrated to be asymptotically exact, backed by theoretical analysis and pseudocode.
方法与评估标准
Yes. The proposed methods are well-suited for the addressed generative modeling challenges.
理论论述
Theoretical claims are carefully proven, such as the equivalence of the general flow matching guidance to classical diffusion guidance under certain assumptions.
实验设计与分析
Yes. The experimental design is sound, employing appropriate baseline comparisons
补充材料
Yes. Check the code link attached to the paper
与现有文献的关系
The paper builds on existing work in flow matching and diffusion models
遗漏的重要参考文献
No
其他优缺点
No
其他意见或建议
No
Thank you for acknowledging the theoretical and empirical soundness of our work, as well as the potential impact in addressing the challenges in the generative modeling field. We answer your questions in the following:
Q1: How can the MC sampling efficiency be improved?
Many existing techniques can be readily applied to enhance the efficiency of . For example, we can adopt importance sampling to reduce the variance of the MC estimation. Specifically, using
we can select an alternative distribution such that has lower variance (i.e., when is large, is also large, and vice versa). This can be achieved by using another guided VF to sample from , such as using . Then can be estimated using, for example, the Hutchinson trace estimator to preserve scalability [1]. It should be noted that this estimation is still unbiased. We conduct experiments to validate the effectiveness of this method, and the results in the table below showcase a performance comparable to the state-of-the-art methods on the image inverse problem, where the vanilla fails.
| Method | FID | LPIPS | PSNR | SSIM |
|---|---|---|---|---|
| 7.863 | 0.1889 | 23.63 | 0.8429 | |
| 22.75 | 0.5589 | 8.67 | 0.3484 | |
| 15.27 | 0.1753 | 25.48 | 0.8700 |
Q2: What are the limitations of our approximation?
First, we would like to clarify our approximation. Our framework allows us to choose any (and hence ) as long as the source distribution is consistent: . In other words, the error induced by setting can be characterized by the deviation in either the source distribution or the vector field. In the former case, we assume the guidance VF to be exact, i.e., . Here the error is induced by the fact that we should have sampled from , rather than the original . In the latter case, we assume the source distribution to be unchanged, i.e., we need to make the source distributions compatible automatically. In this case, the error is caused by approximating with .
- In the case of strongly dependent couplings, still holds as long as varies slowly. This is demonstrated by the small deviation in the error of the source distribution (assuming ) as we discussed above. If J is always near its average value, the new source distribution is almost that is the original source.
- Nevertheless, when both the coupling is strong and varies intensively, a more complicated treatment is required for exact guidance. For example, we can try to sample from the new source distribution . Although one may argue that this may be equally difficult as sampling from the target distribution exactly, it may be learned more easily as the target distribution is potentially smoothed after being convolved with the "kernel" . We will add this discussion to the revised manuscript.
[1] Lipman, et al., Flow Matching for Generative Modeling, ICLR 2023.
All reviewers agreed that this paper provides a solid contribution to the community by mapping the different FM guidance methods in prior work and proposing new and better methods. Most of the issues raised by the reviewers were addressed in the rebuttal. In contrast to training-free guidance methods, there were concerns about the effectiveness of the training-based guidance methods. The authors agreed to change the general term of "guidance" to "energy guidance" to clarify the scope of the paper.