/10

Poster4 位审稿人

最低3最高5标准差0.9

ICML 2025

Controlled Generation with Equivariant Variational Flow Matching

Floor Eijkelboom,Heiko Zimmermann,Sharvaree Vadgama,Erik J Bekkers,Max Welling,Christian A. Naesseth,Jan-Willem van de Meent

OpenReview PDF

提交: 2025-01-24更新: 2025-08-16

TL;DR

We propose Variational Flow Matching for controlled generation, unifying conditional training and post hoc Bayesian inference with an equivariant formulation for molecular generation.

摘要

关键词

Variational Flow MatchingConditional GenerationEquivarianceMolecular Generation

评审与讨论

审稿意见

评分: 32025-03-11

In this paper, the authors present a controlled generation objective in the framework of Variational Flow Matching (VFM), as well as an equivariant formulation of VFM which has applications in 3D molecule generation. For controlled generation, the authors demonstrate that both end-to-end constrained training and post-hoc modification of pretrained models is possible under their methodology. Results are demonstrated on several unconditional and conditional molecular generation tasks, including the QM9, GEOM-Drugs, and Zinc250k datasets.

给作者的问题

How does the performance of the proposed post-hoc modification approach change as a function of how out-of-distribution the target property is for the generative model and/or classifier?

Is classification always used as guidance even if the target is continuous-valued (e.g by binning into discrete classes). Or is the approach equally valid for regression as well?

论据与证据

In Table 2, the authors present results on continuous molecular generation with G-VFM as “our variational treatment of flow matching”. However, as I understood, this Unconditional Generation section is simply reproducing/re-implementing the previously described VFM framework on these specific datasets. The authors state this explicitly for Table 1 (discrete generation), but not for Tables 2 and 3. Either the specific contributions from this work for unconditional generation should be made more clear, or this section should be moved to the Appendix, because it is not in service to the story/main contributions of the paper, which I interpreted to be controlled generation and equivariance.
D-Flow seems to considerably outperform the proposed G-VFM methods for controlled generation in Table 4. The authors claim that this comes at the expense of increased generation cost for D-Flow, but I didn’t see a specific quantification of this cost (I may have missed it). NFE should be added to Table 4, unless I am missing something. Without this info, it is hard to make a judgment about whether the substantially worse results of G-VFM relative to D-Flow are justified by computational acceleration.

方法与评估标准

Given the exclusive focus on molecule generation and a methodological section devoted to proving equivariance properties of VFM, I did not find any investigation/mention of equivariance in the experiments/evaluation (for instance, a direct ablation/comparison between VFM with the equivariance properties satisfied vs not satisfied). This begs the question: if you can enforce equivariant/invariant generative dynamics, so what? Is it actually practically useful for, e.g., producing more physical realistic samples and/or achieving better conditioning scores? While the authors perhaps treated this as self-evident, there is growing evidence in adjacent fields that equivariance can be learned from data without explicit enforcement, and that this is actually the preferred approach as models/data scales [1]. I would like to see a more thorough investigation of this.
Related to #1, given the lack of focus on equivariance in the results, I would also expect much broader evaluations/demonstration of the proposed method on different modalities in which equivariance is not required such as controlled image generation (with target properties like compressibility, prompt-image alignment, aesthetic quality, etc.) See [2] for details. Without this, I feel that the current scope of evaluation of the paper is too narrow to be of broad interest.
Given the explicit mentioning of alternative conditioning strategies like classifier guidance (CG), I would have expected a direct comparison to CG and CFG to empirically demonstrate the benefits.

[1] Qu, Eric, and Aditi Krishnapriyan. "The importance of being scalable: Improving the speed and accuracy of neural network interatomic potentials across chemical domains." Advances in Neural Information Processing Systems 37 (2024): 139030-139053.

[2] Black, Kevin, et al. "Training diffusion models with reinforcement learning." arXiv preprint arXiv:2305.13301 (2023).

理论论述

The primary theoretical claim is that given appropriate choices of the prior, conditional velocity field, and variational posterior, the generative dynamics of VFM are invariant under transformations in SO(n). I did not carefully review the proof. Since I am not very familiar with the literature in this area, can the authors comment on how different/surprising this result is compared to (what I assume are known equivariance properties) of regular FM?

实验设计与分析

See above.

补充材料

I briefly reviewed the proofs in the SI

与现有文献的关系

This work builds upon Variational Flow Matching, first introduced in [1].

[1] Eijkelboom, Floor, et al. "Variational flow matching for graph generation." Advances in Neural Information Processing Systems 37 (2024): 11735-11764.

遗漏的重要参考文献

N/A

其他优缺点

The paper is well written and easy to follow

其他意见或建议

Include citations for baselines directly in the Tables.

作者回复

2025-04-01

Dear reviewer uXdp,

Thank you for the detailed and thoughtful review. We appreciate your close reading and thank you for your constructive suggestions. Below we address the main points raised in your review and clarify several aspects that were not sufficiently emphasized in the original submission.

1. Contribution in Unconditional Generation (Table 2)

We agree that the motivation behind the unconditional generation section was not clearly explained. Our goal with Table 2 was not to introduce a new architecture, but to show that strong performance can be achieved via a simple, unified formulation under the VFM framework. By linearly interpolating between noise and data—whether continuous or discrete—and selecting an appropriate variational distribution (eg, Gaussian or categorical), one can train a single model with a shared objective across modalities. This avoids the need for specialized architectures or training regimes typically required in standard FM for mixed data types, and even generalizes to other distributions (eg, Poisson for neural activity). We have clarified this conceptual simplicity in the revision and added citations for baselines directly in the tables.

2. Comparison to D-Flow and Efficiency Tradeoff

While D-Flow performs better in Table 4, it does so by evaluating the forward process over many candidate initializations ( $n \times K$ evaluations, eg, $10 \times 100$ ). In contrast, our method uses a single forward pass and a short fixed-point calibration for conditional guidance—requiring only $K$ forward evaluations and no additional gradient computations. We prioritize simplicity and scalability, whereas D-Flow trades compute for precision. Both are valid design choices, and we now highlight this tradeoff explicitly in the paper, including a discussion of guidance cost and number of function evaluations (NFE).

3. Equivariance Evaluation and Practical Utility

You are correct that we do not isolate the empirical effect of equivariance. While the theoretical formulation is a key contribution, we agree that deeper empirical validation would strengthen the work. We now clarify which models enforce which equivariant properties and will provide an ablation in the final version.

4. Generality Beyond Molecular Generation

While molecular generation is a natural testbed due to its symmetry constraints, the proposed methods are general. VFM provides a flexible framework for combining learned dynamics with structural inductive biases. We now emphasize this more clearly in the discussion and are actively exploring applications beyond molecules, including images (to be included in the final version, at least as a proof of concept).

5. Comparison to CG / CFG Methods

You raise a valid point. Our method assumes access to a classifier over $x_1$ rather than a time-dependent $p_t(y \mid x_t)$ . While related, these regimes differ in flexibility and cost. CG/CFG typically requires backpropagating through a denoised prediction at every timestep, which is expensive and needs conditional training. Our fixed-point procedure operates post-hoc on the endpoint and does not require joint training. We now clarify this key distinction in the manuscript.

6. Theoretical Novelty of Equivariance Result

While standard flow matching methods can exhibit equivariant behavior under certain conditions, our contribution is to show that equivariance can be guaranteed through the variational distribution—without requiring the velocity field itself to be equivariant. This decoupling is novel and enables greater modularity: symmetry constraints can be enforced directly through the variational family, independent of the learned dynamics. This allows for clean, flexible design of symmetry-preserving generative models and simplifies implementation in practice. We now clarify this point more explicitly in the final version.

Once again, we thank the reviewer for their careful and constructive feedback. Your comments helped us significantly improve the clarity, presentation, and positioning of our contributions.

审稿人评论

2025-04-03

I thank the authors for their clarifications. I have a few more questions.

I still don't understand the contribution of Table 2. What is the precise difference between "the simple, unified formulation of VFM" and the original VFM paper by Eijkelboom, et. al.? And I don't see how mixed modalities/data types factors in here. Is it the case that the original paper only considered discrete data, and you are now showing that it also works well on continuous data as well? If so, this should be made more clear in the presentation of Table 2.
Is eqn 15 (fixed point refinement) applied after every step of the flow matching reverse process? Or only at the end after the unconditional samples have been generated? If the former, then how is it justified to use a classifier trained only on samples from the data manifold and not noised samples? And if the latter, then how does the gradient ascent procedure prevent producing degenerate samples that maximize $p(y | x_1)$ but have low likelihood under the unconditional data distribution? This was not clear to me and I would suggest including an algorithm for the controlled generation as inference to make this clearer, as this is one of the main contributions of the paper.

Also, I still find the absence of empirical results on equivariance to be a significant drawback of the paper, given the attention devoted to it in the theoretical part of the paper.

I would be willing to reconsider my score if these points are addressed.

作者评论

2025-04-08

Thank you again for your thoughtful engagement and helpful questions. We are happy to clarify the remaining concerns.

1. Contribution of Table 2 and mixed modality handling

The original VFM paper by Eijkelboom et al. is indeed only evaluated discrete graph generation, and to the best of our knowledge, continuous data or mixed discrete-continuous settings were not tested anywhere in the literature. As such, the contribution of these experiments is twofold:

We provide the first empirical evaluation of VFM in the continuous setting.
More importantly, we demonstrate that the same objective can be used across both discrete and continuous modalities.

This is particularly attractive for molecular generation, which often involves mixed data types (i.e. atom types, positions, and formal charges). Our results show that VFM offers a unified framework that flexibly adapts to these diverse generative tasks without the need to redesign losses or architectures.

That is, we believe the utility of VFM was under explored and saw a significant gap - especially in the conditional setting - with we addressed in this work.

2. Fixed-point refinement and classifier training

Yes, the fixed-point refinement (Eq. 15) is applied at every step of the reverse process. The reviewer raises an important point about the use of a classifier trained only on clean (data manifold) samples, rather than noised ones.

Our approach is based on the key observation that the posterior can be factorized as:

$p_t(x_1 \mid x, y) \propto p(x_1 \mid x) \cdot p(y \mid x).$

This allows us to separate inference into two parts: Inference of $x_1$ given $x_t$ , handled by VFM, and inference of $y$ from $x_1$ , handled by a classifier.

This decoupling is powerful because it enables post-hoc conditioning using any classifier trained on clean data—removing the need to retrain on noisy samples as in standard classifier guidance for diffusion models. Furthermore, this opens up possibilities for using task-specific classifiers, such as those leveraging geometric or topological inductive biases [1,2]. In this way, our method introduces a new inference-time interpretation of conditional generation in generative modeling. We emphasise the points in the final version and will add an algorithm block for clarification.

In case we do not use a noisy classifier for guidance but rather approximate this classifier based on a 'standard' classifier - which is a form of CG that is closer to our approach - notice that indeed then one has to take gradients w.r.t $x_t$ of the score function. Notice that our approach completely avoids this and only uses the classifier to evaluate. Hence, this work really gives a different way of thinking of generation which we believe to be fruitful based on our initial experiments.

3. Equivariance Ablations

Thank you for highlighting this point again. We agree that this was missing in our initial submission. We have now rerun the key experiments without equivariance constraints and observed a notable drop in performance. These results will be included as an ablation in the final version of the paper. This reinforces the utility of enforcing equivariance in our generative models—although, as recent work such as Vadgama et al. [3] shows, it is not universally required for good performance. We hope our results help nuance this ongoing discussion in the community.

[1] Zhdanov, M., Ruhe, D., Weiler, M., Lucic, A., Brandstetter, J., & Forré, P. (2024). Clifford-steerable convolutional neural networks. arXiv preprint arXiv:2402.14730.

[2] Liu, C., Ruhe, D., Eijkelboom, F., & Forré, P. (2024). Clifford group equivariant simplicial message passing networks. arXiv preprint arXiv:2402.10011.

[3] Vadgama, S., Islam, M. M., Buracus, D., Shewmake, C., & Bekkers, E. (2025). On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds. arXiv preprint arXiv:2501.01999.

审稿意见

评分: 52025-03-12

The paper proposes two novel methods within the variational flow matching (VFM) framework for generative modeling. The first is controlled generation, which enables conditional generation using unconditional generative models without requiring retraining (though it is not necessarily limited to this scenario). The second method introduces equivariant generative models, which are well-suited for molecular generation tasks, where invariance to rotations, translations, and permutations is beneficial.

Background on flow matching and variational flow matching

In the standard flow matching (FM) framework, models directly parameterize the velocity field $u_t(x)$ , i.e., the expected value of the endpoint-conditional velocity field $u_t(x|x_1)$ over the posterior distribution (i.e., the distribution of the endpoint given the current position).

In contrast, the variational flow matching (VFM) framework does not directly model the velocity field. Instead, it models the posterior distribution and takes the expectation later. Thus, the model is trained by minimizing the forward KL divergence between the ground truth and the model posterior. This training process simplifies into matching posterior means when the endpoint-conditional velocity field is linear, such as in conditional optimal transport (condOT). Consequently, an unimodal distribution, such as a Gaussian, can be used for continuous random variables.

Moreover, under the linearity assumption on the endpoint-conditional velocity field, computing the velocity field is further simplified—it can be done by parameterizing the mean of a Gaussian, for example.

Controlled generation

For a controlled generation, the paper highlights a key observation: the endpoint-conditional velocity field should remain unchanged regardless of additional conditions since the endpoint already contains information about those conditions. Leveraging this insight, the authors show that the velocity field of a conditional generative model corresponds to the expected value of the same endpoint-conditional velocity field as in the unconditional model. However, in the conditional case, the expectation is taken with respect to the conditional posterior of the endpoint (rather than the unconditional posterior), given both the additional condition and the current position.

The paper then shows that the conditional posterior of the endpoint can be rewritten in terms of the unconditional posterior and a classifier (Equation 13) using Bayes' theorem up to a normalizing constant. This formulation enables reusing the unconditional model without retraining. However, taking an expectation with respect to an unnormalized distribution is typically intractable, making direct computation infeasible.

To address this, the authors leverage the linearity assumption on the endpoint-conditional velocity field. Since VFM only requires an unimodal distribution, they propose estimating the mean of the conditional posterior by solving a fixed-point equation (Equation 14) iteratively. This iterative process enables controlled generation without sampling from unnormalized distributions.

Equivariant generative models

In addition to the controlled generation, the paper introduces a method for equivariant generative modeling under the VFM framework. A key requirement for this approach is that the expected endpoint-conditional velocity field (computed over the posterior distribution of the endpoint given the current position) must be equivariant under group actions. The paper establishes that this can be achieved if: The prior distribution is group-equivariant.

The endpoint-conditional velocity field is bi-equivariant with respect to both the endpoint and the current position. The expected value of the model's posterior, which is a function of the current position, is group-equivariant with respect to its input (i.e., the current position).

The authors emphasize that when using a conditional optimal transport (condOT) map, the second condition is automatically satisfied if the relevant groups act linearly on the domains of interest, such as SO(n), and this would be the case for molecular generation, which are the main application of interest of the paper.

Experimental results

The paper demonstrates the effectiveness of the proposed methods on several benchmark molecular generation datasets, including QM9, ZINC250k, and GEOM.

给作者的问题

N/A

论据与证据

In my understanding, the paper's contributions are clear. The proposed methods--controlled generation and equivariant generative modeling under the variational flow matching (VFM) framework--are both conceptually insightful and practically impactful.

These contributions are particularly significant for several reasons. First, the controlled generation approach introduces an efficient mechanism for performing conditional generation without retraining (though it is not necessarily limited to this scenario, as discussed in the paper). Lacking an efficient controlled generation method for the flow matching framework has led many practitioners in various applications to prefer denoising diffusion. The proposed controlled generation is particularly valuable in scenarios where retraining is computationally expensive or infeasible.

Second, the introduction of equivariant generative models under the VFM framework is a notable theoretical contribution, as it ensures that the generative process respects fundamental symmetry constraints. This is crucial for applications such as molecular generation, where invariance to rotations, translations, and permutations is essential for generating physically meaningful structures.

Furthermore, the paper provides sufficient mathematical formulation and justification to support the proposed methods concisely, allowing readers to understand them easily. The experimental results on benchmark molecular datasets further validate the effectiveness of these approaches, demonstrating their potential real-world impact.

Overall, I find that the paper not only addresses a well-motivated problem but also offers valuable solutions that could inspire future research directions in both generative modeling and scientific machine learning.

方法与评估标准

N/A

理论论述

N/A

实验设计与分析

N/A

补充材料

N/A

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

Dear reviewer 6iCA,

We sincerely thank the reviewer for their comprehensive assessment of our paper.

We appreciate your recognition that these contributions are "conceptually insightful and practically impactful" and that our work addresses an important gap in flow matching frameworks. We are glad to hear the reviewer is on the same page about the interesting avenues to explore regarding integrating 'classical' inference techniques with SOTA generative modeling.

审稿意见

评分: 32025-03-14

The core contributions of the paper are twofold.

[1.Inference time control of VFM] The authors show that a conditional VFM distribution can be factorized into unconditional VFM part and the classifier part. Based on this factorization, the authors propose an iterative approximate solution to $\underset{x_1}{\mathrm{argmax}}\\, p_t(x_1|x,y)$ that can be solved during the inference time.

[2.Achieving equivariance in VFM] The authors show that equivariance of VFM can be achieved by using an equivariant variational distribution, when the conditional vector field is equivariant.

给作者的问题

At first glance, it is unclear how exactly $q_t(x_1|x)$ is modeled. I presume that $q_t(x_1|x)$ is simply a mean field Gaussian for continuous variables and a categorical distribution for discrete variables as in the original VFM paper. Is this correct?

论据与证据

1.Inference time control of VFM

For the inference-time control, the authors propose to optimize $\underset{x_1}{\mathrm{argmax}}\\, p_t(x_1|x,y)$ by iteratively solving $x_1$ that satisfies the fixed-point equation $\nabla_{x_1} \log p_t(x_1|x) + \nabla_{x_1} \log p_t(y|x_1)=0$ . Since computing the score function $\nabla_{x_1} \log p_t(x_1|x)$ is intractable, the authors propose to approximate the score function with only the first moment $\mu_t$ , resulting in equation (15).

However, several strong assumptions are not sufficiently justified. Although the proposed approximate guidance appears to offer some empirical benefits, it lacks thorough theoretical or experimental validation beyond incremental metric improvement (e.g., adding the approximate guidance increases metric XX by YY%).

For example, the authors assume log-concavity of the classifier $\log p(y|x_1)$ . However, most of the deep classifiers are not log-concave. Hence, as mentioned in the paper, the convergence is not guaranteed, which is in contrast to typical classifier guidance for SDE-based models. Another strong assumption is the approximation of $p_t(x_1|x)$ with a Gaussian centered at the mean of the variational distribution. Although the update rule derived from this approximation is simply a VFM + classifier gradient, which seems reasonable, it is unclear whether this provides a good Bayesian approximation.

2. Achieving equivariance in VFM.

The authors show that VFM can be made equivariant by utilizing an equivariant variational distribution instead of directly modeling the equivariance to the marginal vector field network. This is indeed a significant benefit of the VFM approach, as it allows one to flexibly put inductive bias to the distribution instead of the marginal vector field.

方法与评估标准

By utilizing VFM framework, the authors shows that it is possible to simplify the implementation complexity of an equivariant flow matching model for a mixed discrete-continuous molecular generation tasks. In particular, the proposed method, G-VFM, matches the performance of SemlaFlow with a simpler training pipeline. The authors have also demonstrated the benefit of the proposed inference-time controlled generation.

理论论述

As mentioned earlier, the authors used several strong assumptions to derive Equation 15. This makes its Bayesian correctness questionable. However, the final update rule looks reasonable, regardless of its theoretical accuracy. Other proofs look correct.

实验设计与分析

As mentioned before, the paper could be improved with an in-depth qualitative/quantitative analysis on the accuracy of the Bayesian posterior approximation for inference-time controlled generation.

补充材料

I have reviewed the proofs in Appendix A. Although I have not verified every claims with full mathematical rigor, they seem valid within engineering-level contexts.

与现有文献的关系

The flexibility of the proposed equivariant VFM approach could be beneficial for other areas such as 3D vision or robotics.

遗漏的重要参考文献

Relevant works are appropriately referenced.

其他优缺点

There are several unreference varibles in the paper. For example, it is unclear where $\mu_t(x)$ comes from. I presume this is the mean of the variational distribution $q_t(x_1|x)$ . The presentation of the paper could be improved if the authors explicitly state that $\mu_t(x)$ is approximated with that of $q_t(x_1|x)$ . Also, it would be helpful for the readers to understand equation 15 if the authors explain how equation 15 is related to VFM.

其他意见或建议

There is a typo in Table 3. The Mol Stab of G-VFM (=99.5) is highligted in bold, although SemlaFlow has higher score of 99.6.

作者回复

2025-04-01

We thank the reviewer for the clear summary and thoughtful comments. We appreciate the recognition of our core contributions, as well as the constructive suggestions that helped us clarify and strengthen the presentation. Below, we address the reviewer’s main points.

1. Inference-Time Control: Assumptions and Approximation Accuracy

We agree that our formulation suggests simplifying assumptions (eg, log-concavity of the classifier). These are now stated explicitly and discussed in the revised manuscript.

While VFM connected variational inference to unconditional flow matching, our goal is to extend this to conditional generation and introduce an equivariant formulation. We show that one can perform variational inference on $p_t(x_1 \mid x, y)$ , and more importantly, use this perspective to rethink conditioning in generative models. A simple proof-of-concept demonstrates this idea.

Our aim is not a full Bayesian treatment, but to show that minimal inference-time updates—without retraining and with negligible overhead—enable effective post-hoc control. Using only the posterior mean is sufficient: in endpoint-based models (eg, flows, diffusions), the expected conditional velocity depends only on this value (see VFM). This eliminates the need for higher-order terms, keeping the method scalable while still improving empirical performance. As such, issues like log-concavity are less critical for our setting—we propose a simple integration of inference with generative modeling and focus not on recovering the full posterior, but on obtaining a reliable estimate of the posterior mean.

Broadly, this work opens a direction within VFM: enabling modular inference-time control by integrating classical inference tools with learned approximations. We clarified this vision in the revision, and hope it inspires further work in this efficient design space.

2. Clarification of Notation and Presentation

We thank the reviewer for catching unclear notation and formatting issues. We confirm that $\mu_t(x)$ denotes the mean of $q_t(x_1 \mid x)$ and have made this explicit. We revised the explanation around Eq. 15 to clarify its relation to VFM: the fixed-point update approximates the mean of $p(x_1 \mid x, y) \propto p(x_1 \mid x)p(y \mid x_1)$ .

We also corrected the bolding in Table 3 and improved formatting throughout.

3. Modeling Details of $q_t(x_1 \mid x)$

Yes, we use a mean-field Gaussian for continuous variables and a categorical distribution for discrete ones, as in the original VFM. We added a brief clarification in the revised text.

Once again, we thank the reviewer for the insightful feedback, which helped refine both the clarity and precision of our work.

审稿人评论

2025-04-09

Dear authors, thank you for your clarification. My concerns have been addressed and I would like to keep my initial recommendation to accept the paper.

审稿意见

评分: 32025-03-14

The paper focuses on extending the recently proposed Variational Flow Matching (NeurIPS 2024) approach for conditional generation and for incorporating inductive biases as symmetries. They derive two different ways for controlled generation, the first one is similar to conditional diffusion models, with the difference that in this case, they learn a conditional vector field and one that resembles classifier guidance, where the conditioning is post-hoc. They then focus on the problem of generating molecules where the symmetries of the true underlying distribution that has to be learned present specific symmetries. They consider three different molecular generation tasks, discrete, continuous, and joint generation tasks: in the first one they just focus on atom types and bond types, in the second one they focus on atom positions and sometimes atom type, and in the third one they focus on everything plus formal charges. Results show that the variational formulation of flow matching is beneficial.

给作者的问题

I have just one question for the authors, if I am not completely wrong, it seems that equation 13 can be written as:

p_t(x_1|x,y) = \frac{p_t(x_1|x)p_t(y|x_1)}{p_t(y|x)} \approx p_t(x_1|x)p_t(y|x_1)

where at the denominator the time-dependent classifier appears. Is it something that can be used to improve the guidance? It might be a completely useless question, but since I try to derive Eq. 13 I was curious to ask if that can be used.

论据与证据

The main claim in the papers is that Variational Flow matching is beneficial on top of Flow matching especially for conditional generations and when the true distribution presents symmetries. They considered mostly the problem of molecular generation and the results show that the variational formulation is helpful.

The authors mention several times that "VFM provides a unified approach that can be applied to any combination of discrete and continuous molecular features", while I agree that the benefit is that the loss is the same, one still has to choose a different type of variational distribution. I found it a bit unfair with respect to diffusion models for example as, in that case, one has to choose a specific stochastic process instead of a different variational distribution. But this is a minor detail and the authors might also disagree with me.

方法与评估标准

Yes, the benchmark datasets are the usual one that people uses for evaluating generative models for molecular generation, and also the metrics considered are correct. I have a few comments about the tables that I leave in the experimental analysis sections.

理论论述

They have two main propositions in the paper. The first one is about showing that the conditional on some observations vector field generates the conditional probability path. This was also derived in [1], but [1] did not provide a full proof. The second theoretical claim shows that the marginal path is invariant under G if we have the following elements: a prior that is invariant under G, a conditional velocity field that is bi-equivariant, and a variational posterior that is equivariant.

References

[1] Zheng, Q., Le, M., Shaul, N., Lipman, Y., Grover, A., & Chen, R. T. (2023). Guided flows for generative modeling and decision making. arXiv preprint arXiv:2311.13443.

实验设计与分析

The design of the experimental design is valid. I just have a few comments regarding the analysis:

the use of bold numbers is a bit misleading. For example, in Table 1 the authors bold both $100.00$ and $99.99$ , but not if the difference is $0.02$ in the unique results column. In Table 2, EDM and EFM perform exactly the same in terms of atom stability and molecule stability, but their results are not bolded. In Table 3, Semla Flow is performing better on QM9 in terms of Mol. stability but the result of G-VFM is bolded. Also, EquiFM results on QM9 for JS(E) should be bold as they perform the same as SemlaFlow.
I think that the results in Table 3 do not present an apple-to-apple comparison as some of the methods like EDM, GCDM, and EquiFM do not generate bond information, which has been shown to be helpful for getting higher validity. Therefore, I will invite the authors to make it clear in the main text what the different models are generating, e.g. if they are generating bonds and charges. That's the main reason for the huge difference in terms of metrics on QM9 in Table 3.
I don't really see why having Table 1 in the main text as results and comparison were presented already in the Variational Flow Matching paper.
It would be nice if the authors presented the full details of model parameters and training details in the appendix. Also, the paper will benefit if the results in Table 2 authors specify if they are modeling atom types or not.

补充材料

I went through all the sections of the supplementary material.

与现有文献的关系

The paper places itself in the flow matching landscape, building on top of Variational Flow Matching approach [1]. They proposed techniques specific to this approach for conditional generation similar to what people usually do when they train conditional diffusion models [2] or perform post-hoc conditioning using classifier guidance [for example 3]. However, I feel that the method section does not cite any relevant references for the entire Section 2 and 3, which I invite the author to add. For example, the main part section 3.1, and the proposition were also derived in [4], which the authors do not cite. As the main application is molecular generation, I think that the paper presents methods relevant to people working in that field. Although [5] is not evaluating log-likelihood it is a related work for Table 2.

References

[1] Eijkelboom, F., Bartosh, G., Andersson Naesseth, C., Welling, M., & van de Meent, J. W. (2024). Variational flow matching for graph generation. Advances in Neural Information Processing Systems, 37, 11735-11764.

[2]Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.

[3]Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., & Ye, J. C. (2022). Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687.

[4] Zheng, Q., Le, M., Shaul, N., Lipman, Y., Grover, A., & Chen, R. T. (2023). Guided flows for generative modeling and decision making. arXiv preprint arXiv:2311.13443.

[5] Cornet, F., Bartosh, G., Schmidt, M., & Andersson Naesseth, C. (2024). Equivariant neural diffusion for molecule generation. Advances in Neural Information Processing Systems, 37, 49429-49460.

遗漏的重要参考文献

See above

其他优缺点

The paper is nicely written and it tackles an important task in deep generative models which is conditional generation and application to model distribution that has symmetries, which is of interest to the community. I would like just to add one more comment on something that the authors mention in the paper on line 181 (right column): "unlike standard classifier-guidance methods in diffusion, which require a time-dependent classifier $p_t(y |x)$ , classification in VFM is performed on data pairs $(x_1,y)$ ". I think this is not entirely correct as in diffusion one can also use a classifier trained on the clean samples, but at each denoising step, one has to first apply Tweedie to get the approximate clean sample.

其他意见或建议

Line 422: Fischer flow instead of Fisher flow

作者回复

2025-04-01

Dear reviewer TqUv,

Thank you for the detailed and thoughtful review and for engaging deeply with both the theoretical and empirical aspects of our work. Below we respond to the key points and how they will be addressed in the revised manuscript.

1. Unified Objective and Variational Distributions

We fully agree: while VFM provides a unified objective, it still requires variational distributions per modality. However, our contribution is to show that these choices can be modular and plugged directly into the same objective, avoiding the need to redesign the loss or architecture per data type—a common challenge in standard FM pipelines. Though not appropriately explored here, this even naturally extends to other distributions (eg, Poisson for neural data) by learning the conditional rate and optimizing the NLL. We clarified this core motivation in the revision.

2. Fairness of Table 3 Comparisons

While our method is flexible enough to generate any subset of molecular features (e.g., just positions or positions plus atom types), in all experiments, we matched the generation scope of each baseline for a fair comparison. For example, if a baseline did not model bond structures or formal charges, we excluded those as well. We now clearly state this in the text and annotate Table 3 to indicate which components each model generates.

3. Simplifying Assumptions and Classifier Guidance

We now explicitly state and discuss simplifying assumptions (eg, classifier log-concavity) in the revised manuscript.

This work extends VFM with an inference-based view of conditional generation. We perform variational inference on $p_t(x_1 \mid x, y)$ in a simple proof-of-concept setup that yields meaningful improvements. Due to the linearity of the conditional velocity (as in flows/diffusions), matching only the posterior mean suffices (see VFM), and a full Bayesian treatment is unnecessary. Our main contribution is showing that this inference view enables scalable post-hoc control by combining classical inference tools with learned approximations. For more detail, we refer to our response to reviewer k6L2. This also addresses the reviewer's question about the normalization constant in Eq. 13—your interpretation is correct.

Furthermore, while related to classifier guidance (CG/CFG) in diffusion, our approach differs significantly in cost and flexibility. CG typically requires backpropagating through a denoised prediction at each timestep, which can be expensive and must be trained conditionally. In contrast, our fixed-point method operates post-hoc at the endpoint, with no joint training. Finally - even though is it true that true the Tweedie transform a similar effect can be obtained, the approach would arguably be rather noisy and therefore hard to learn. We now clarify this key distinction in the paper.

4. Author questions / Minor comments:

Missing citations: We agree and will cite [4], [5], and other related work on conditional generation and classifier guidance. These citations have been added in Section 3.1.
Missing Experimental Details (Table 2): We have included full model architecture and training details in the appendix, and clarified in Table 2 whether atom type modeling is included per experiment.
Table Formatting and Metric Presentation: We corrected the inconsistent bolding and formatting in Tables 1–3. Bolding is now applied uniformly (best per column, including ties), and we added global method rankings to aid interpretability across datasets.
Table 1 Repetition: We included Table 1 to demonstrate that G-VFM reproduces the results of standard VFM models as a consistency check. That said, we agree that this space could be better used to highlight new contributions, so we have moved Table 1 to the appendix.
Typo: We have corrected “Fischer flow” to “Fisher flow” (line 422).

We again thank the reviewer for their constructive feedback. Your comments helped significantly improve the clarity, fairness, and completeness of the revised manuscript.

审稿人评论

2025-04-04

I would like to thank the authors for answering my questions. I would like to get these additional points clarified:

1- In Table 3 (QM9 experiment), EDM, GCDM, and EquiFM do not generate bond information while SemlaFlow does. Therefore I am a bit confused by the answer you gave me. Modelling the bond information usually helps in generating more stable molecules as it can be seen from the gap between EDM, GCDM, EquiFM and SemlaFlow itself. By looking at the results of G-VFM it seems that bonds are modelled in that case. Therefore, I really think that it should be clear from text or table caption that part.

2- I think I also need some more clarification on this point CG typically requires backpropagating through a denoised prediction at each timestep, which can be expensive and must be trained conditionally. In contrast, **our fixed-point method operates post-hoc at the endpoint, with no joint training**. I think it is related to what Reviewer uXdp is also asking. How many refinement steps $k$ are you doing usually? Also, in diffusion, to perform classifier guidance, you don't need joint training with the diffusion and classifier. In the case of a time-dependent classifier $p(y|x_t)$ , one needs the noising schedule of the diffusion model, but one can also just use a pre-trained classifier $p(y|x)$ (where $x$ is the cleaned sample) and then at each reverse step get $x$ by Tweedie which is $O(1)$ operation. Maybe a pseudo-algorithm explaining the proposed method might help.

Edit after reading authors' answers to my comments

I really think that by incorporating appropriate citations in the method section, by having a more fair discussion of results, and by making the overall proposed method more clear by having a pseudo-algorithm can make the paper really stronger. I believe that you are going to make these changes in the updated version of the paper. Therefore, I will increase my score.

作者评论

2025-04-08

Thank you for your follow-up and for the clear articulation of your concerns.

1. Clarification on bond modeling in Table 3

Thank you for pointing this out—we agree this distinction was not sufficiently clear in the original submission. We initially followed the evaluation setup from SemlaFlow, where models that do and do not jointly generate all molecular attributes (including bonds) are compared within the same table. We believed that fully modeling all attributes jointly, as we do, constituted the more challenging setting. That said, we now make this distinction explicit and will revise Table 3 and the surrounding discussion accordingly.

2. Clarification on classifier guidance vs. our fixed-point method

Indeed, as you and Reviewer uXdp correctly note, classifier guidance in diffusion models does not necessarily require joint training. One can use a classifier $p(y \mid x)$ trained on clean data and approximate the noisy conditional $p(y \mid x_t)$ using e.g. Tweedie denoising. However, this still involves computing gradients through the score function with respect to the inputs $x_t$ , since the clean prediction is a function of the noised sample.

Our method differs in that it operates entirely at the endpoint level. The classifier is only evaluated directly on $x_1$ , and as such removes the need for backpropagation through the learned dynamics entirely.

This makes the method lightweight and modular—allowing us to plug in any classifier, even ones trained on structured, topological, or symmetry-aware features, without retraining or modifying the generative model. We will include a pseudo-algorithm in the final version and clarify that we typically use 3–10 refinement steps in practice.

More broadly, we believe our approach offers a novel perspective on conditioning in generative modeling—reframing controlled generation as a form of inference rather than training. Our initial results suggest that this perspective is not only conceptually valid, but also empirically effective, and believe this opens up doors to new controlled generation techniques.

最终决定Accept (poster)

2025-05-01

This paper extends Variational Flow Matching (VFM) for conditional generation and introduces methods for inference-time control and symmetry-aware modeling.

Pros:

Introduces an inference-time controllable approach for conditional generation using VFM, offering flexibility similar to classifier guidance in diffusion models.
Provides a method to achieve equivariance in generative models through variational distribution design.

Cons:

Relies on strong assumptions (e.g., log-concavity and Gaussian approximation) that lack rigorous theoretical or experimental justification.
Experimental comparisons are sometimes unclear due to differences in generated features across models.
Some notational and presentation issues (e.g., undefined variables and unclear derivation of equations) reduce clarity and accessibility.