7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.5

置信度

创新性2.5

质量3.0

清晰度2.5

重要性2.5

NeurIPS 2025

Path Gradients after Flow Matching

Lorenz Vaitl,Leon Klein

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

We investigate the benefits of using path gradients to fine-tune CNFs initially trained by Flow Matching, in the setting where a target energy is known.

摘要

关键词

Path GradientsBoltzmann GeneratorsContinuous Normalizing FlowsEquivarianceOptimal TransportFlow MatchingVariational InferenceMolecular Dynamics

评审与讨论

审稿意见

评分: 4置信度: 32025-06-27

The paper presents a method for fine-tuning trained CNFs by using gradient information at the target samples in the context of learning Boltzmann generators (with access to samples). The method is shown to improve KL, ESS, and energy based metrics when compared to further FM training with the same computational cost, despite incurring significantly greater computational cost per training step.

优缺点分析

Strengths The paper's principle motivation is that the training of Boltzmann generators should effective leverage energy and samples, a point which should be commended and pursued in the literature. The experiments indeed show across-the-board improvements on top of CNFs from prior works as a result of the energy fine-tuning procedure pursued here. The experiments are done with proper training time controls, and there is careful accounting and explanation of the NFE cost of the method.

Most serious concerns To this reviewer familiar with CNFs and Boltzmann generators but not the path gradient method, it is not immediately clear how the path gradient loss here differs from the standard maximum likelihood loss with CNFs. The two cited path gradient papers consider the problem of minimizing reverse KLs, as written in Eqs 13, 14. Instead, the authors recast likelihood maximization $KL(p_1, q_\theta)$ into a reverse KL by considering $T^{-1} \theta(x)$ with $x$ drawn from target samples to be the parameterized approximation of the fixed prior $q_0$ . This is a significant enough departure that a substantial discussion about the properties of this new objective is merited. It leads to some clunky notation (which should have a standalone short explanation), but more importantly, since it appears identical to an objective that does not use gradient information, it is not clear where the target gradient information is coming from. At the very least, the flipped variant of Eq 14 actually optimized in practice should be written out and dissected. My best guess is that the term inside the expectation is somehow not equal to $\frac{\partial}{\partial \theta}\log\frac{p_1(x_1)}{q_{1,\theta}(x_1)}$ but differs by a control variate that contains $\nabla \log p_1(x_1)$ . If the authors can provide a satisfactory explanation in the rebuttal (and promise include it in the supplement for publication), I am open to raising the score.

Other concerns

Why is there an x axis for Fig 2 but not Fig 1?
Please label figure captions with the task / system being sampled.
L206 memory -> compute
The difference between the two evaluation settings is not very clear. Isn't "applying path gradients at the end" the same as fine-tuning?
The text should state the difference between the models labeled Klein et al 2023b and Klein & Noe 2024.
L226 Standard, OT, and EQ OT FM are all mentioned, but only one row per system is shown in Table 1. Please clarify. Is L226 referencing only the 2nd set of experiments?
"We expect path gradients to become SOTA for BGs and generative models in science" - this statement is far too strong.

问题

Actionable items are discussed in Strengths And Weaknesses.

局限性

Yes.

最终评判理由

The authors have addressed my primary concern regarding clarity and have committed to including detailed explanations and background in the revision.

格式问题

The authors seem to have manipulated the template to reduce spacing between paragraphs.

作者回复

2025-07-29

We thank the reviewer for the valuable input. We fully agree with every point raised. We brushed over the actual gradient estimators, which is something we will add to the manuscript.

And yes, your guess is correct. The maximum likelihood (ML) and PG estimator minimize the same loss and the estimators only differ by a control variate term. But the PG estimator for the forward KL was actually introduced and discussed in Chapter 4 of [1].

In the following, we first discuss the ML and PG estimators for the forward KL and show benefits of PG over FM for a toy example. This additional discussion is a great addition to the paper, we hope you agree.

The Maximum Likelihood and Path Gradient Estimators.

In short: both the ML and PG estimators are unbiased and consistent, but they differ in variance. For PG, we can give guarantees about their variance, once $q_\theta$ equals to $p$ , see e.g. [2, 3,4,5].

Both Maximum Likelihood and the Path Gradient estimators optimize the Forward KL Equation (8). Let’s look at both estimators in detail and compare them. A full derivation can be found in Appendix B.3.2 of [1] & Chapter 4 of [2].

To obtain the estimator $\mathcal G_{ML}$ , we first observe that the first term in the KL divergence is constant w.r.t. $\theta$ . $KL(p|q_\theta) = E_{p(x_1)} \left[ \log p(x_1) - \log q_\theta(x_1) \right],$ which means that it does not enter in the gradient if we directly estimate the gradients via an MC estimator.

$\frac{d KL(p|q_\theta)}{d \theta} = - E_p(x_1) \left[\frac{d}{d \theta} \log q_\theta(x_1) \right] \approx \mathcal G_{ML} = -\frac{1}{N} \sum_{i=1}^{N} \frac{d }{d \theta} \log q_\theta(x_1^{(i)}), x_1^{(i)} \sim p$ .

For the calculation for the estimator, we decompose it fully with

q_\theta(x_1) = q_0(T_\theta^{-1}(x_1)) \left| \det \frac{\partial T_\theta^{-1}(x_1)}{\partial x_1} \right|

The actual calculation for the ML estimator then is $\mathcal G_{ML} = -\frac 1 N \sum_{i=1}^{N} \frac{d }{d \theta} \left(\log q_0(T_\theta^{-1}(x_1)) + \log \left| \frac{\partial T_\theta^{-1}(x_1)}{\partial x_1}\right| \right) (*).$

For PG, we use Eq. (17) to directly obtain the MC estimator $\mathcal G_{PG}$ $\frac{d}{d \theta} KL(p_{0, \theta}| q_0) = E_{x_1 \sim p_1} \left[\frac{\partial }{\partial x_0} \left(\log \frac{p_{0, \theta}}{q_0}(x_0) \right) \frac{\partial T_\theta^{-1}(x_1)}{\partial \theta} \right] \approx \mathcal G_{PG} = \frac 1 N \sum_{i=1}^N \frac{\partial }{\partial x_0^{(i)} } \left(\log p_{0, \theta}(x_0^{(i)}) - \log q_0(x_0^{(i)}) \right) \frac{\partial T_\theta^{-1}(x_1^{(i)})}{\partial \theta}, x_1^{(i)} \sim p.$ Here $x_0^{(i)}$ is a shorthand for $T_\theta^{-1}(x_1^{(i)})$ . If we again decompose the term we expose the full calculation $\mathcal G_{PG} = \frac 1 N \sum_{i=1}^{N} \frac{\partial }{\partial x_0^{(i)} } \left(\log p(T_\theta(x_0^{(i)})) + \log \left| \frac{\partial T_\theta(x_0)}{\partial x_0}\right| - \log q_0(x_0^{(i)}) \right) \frac{\partial T_\theta^{-1}(x_1^{(i)})}{\partial \theta} (**)$

Comparing (*) and (**), we see that the path gradient estimator incorporates $\mathcal G_{PG}$ the gradient information of the target $p$ while $\mathcal G_{ML}$ - and by construction also $\mathcal G_{FM}$ - does not.

Variance of the estimators:

Opposed to $\mathcal G_{ML}$ and $\mathcal G_{FM}$ , we have nice guarantees about the variance of the PG estimator $\mathcal G_{PG}$ at the optimum and close to it.

We first recapitulate results from previous work and then, by using a simple example, show that $\mathcal G_{FM}$ does not necessarily exhibit zero variance at the optimum.

In general, the variance of the Path Gradient estimators $\mathcal G_{PG}$ is bounded by the squared Lipschitz constant of the term $\left(\log p_{0, \theta}(x_0) - \log q_0(x_0) \right) = \left(\log p(x_1) - \log q_\theta(x_1) \right)$ ([6] Section 5.3.2).

Thus, if the target density $p$ is not well approximated by $q_\theta$ , the variance of the gradient estimator can be large and training with PG might not be beneficial. If $\log p$ and $\log q_\theta$ are close in the sense that the Lipschitz constant of their difference is small, we can assume path gradient estimators to be helpful.

Gradient estimators at the optimum

In the case of perfect approximation, i.e. $q_\theta(x) = p(x) \forall x \in \mathcal X$ , the following statements about the ML estimator and the Path Gradient estimators are known. The gradients for path gradient estimators are deterministically $0$ , i.e. $E[\mathcal G_{PG}] = 0, Var[\mathcal G_{PG}] = 0$ , while for ML the variance is generically nonzero $E[\mathcal G_{ML}] = 0, Var[\mathcal G_{ML}] = \frac 1 N \mathcal I_\theta$ . Where $\mathcal I_\theta$ is the Fisher Information Matrix of $p_{0, \theta}$ [2].

Why is that?

Already the review [7] notes in Section 2.3.3 that the duality of the KL divergence means that fitting the model $q_\theta$ to the target $p$ using ML is equivalent to fitting $p_{0,\theta}$ to the base $q_0$ under the reverse KL . This means that the results from [2] directly hold. We only adapted the notation.

Variance of $\mathcal G_{FM}$ for a toy example

We do not have a term for the variance of $\mathcal G_{FM}$ for general settings, but we can show that it does not deterministically vanish like for $\mathcal G_{PG}$ via a simple example. This means better behavior for PG than for FM at the optimum, which we also verified empirically.

Our assumptions aim to simplify the example as much as possible.

First, assume the standard loss for Flow Matching.
Further, assume the two densities to be the same $D$ -dimensional Normal distribution $q_0 = p = \mathcal N(0,I)$ .
Finally, assume the CNF is the identity parametrized by a single parameter $\theta$ , i.e. $v_\theta = \theta = 0$ .

In this example $q_\theta(x_1) = q_0(x_1) I$ approximates the target density $p$ perfectly and the optimal gradient estimators is $0$ . Yet, in this setting the variance of the estimator $\mathcal G_{FM}$ is non-zero: $Var[\mathcal G_{FM}] =\frac 8 {ND},$ preventing the model from staying at the optimum during training.

Proof:

The gradient estimator for the FM loss Eq (10) is $\mathcal G_{FM} = \frac 1 N \sum_i \frac{\partial }{\partial \theta} \left(|| v_\theta - ( x_1^{(i)} - {\tilde x}_0^{(i)}) ||^2\right),$ where ${{\tilde x}_0^{(i)}, x_1^{(i)} \sim \mathcal N(0,I)}$ . (Note that here ${\tilde x}_0^{(i)}$ and $x_1^{(i)}$ are independently sampled. Before $x_0^{(i)}$ was the transformed sample).

First, we break down the terms

\mathcal G_{FM}= \frac{1}{N} \sum_{i=1}^{N} \frac{\partial }{\partial \theta} \frac{1}{D} \sum_{d=1}^{D}\left(v_\theta^d +\tilde x_0^{(i),d} - x_1^{(i),d} \right)^2 = \frac{1}{ND} \sum_{i=1}^N \sum_{d=1}^D 2 \left(v_{\theta}^d + \tilde x_0^{(i),d} - x_1^{(i),d}\right) \frac{\partial v_{\theta}^d}{\partial \theta}.$$ Because $v_\theta$ is parametrized by $\theta$, we set in $\frac{\partial v_\theta^d}{\partial \theta}=1$ and $v_\theta^d=0$ and separate the terms $$\mathcal G_{FM}= \frac {1} {ND} \sum_i \sum_d 2 ( \tilde x_{0}^{(i),d} - x_{1}^{(i),d}) = \frac 2 {ND} (\sum_d \sum_i \tilde x_{0}^{(i),d} - \sum_d \sum_i x_{1}^{(i),d}). $$ We can compute the distribution of $\mathcal G_{FM}$ by using the property $\sum_{i=1}^N a_i \sim \mathcal N(0, N \sigma_a^2)$ for $a \sim \mathcal N(0, \sigma_a^2)$. So $\sum_d \sum_i \tilde x_{0}^{(i),d} \sim N(0, D N)$ and $\sum_d \sum_i \tilde x_{0}^{(i),d} - \sum_d \sum_i x_{1}^{(i),d} \sim \mathcal N(0, 2 D N).$ The expectation of the estimator $\mathcal G_{FM}$ then is 0 and the variance is: $Var[\mathcal G_{FM}] = \frac 4 {N^2D^2} Var[(\sum_d \sum_i \tilde x_{0}^{(i),d} - \sum_d\sum_i x_{1}^{(i),d})] = \frac 4 {N^2D^2} 2DN = \frac 8 {ND}$. Interestingly, the derivative $\mathcal G_{FM}$ is invariant to re-ordering because the sums of $x_1$ and $\tilde x_0$ are invariant to permutation. The result thus also holds for OT-FM. [1] L. Vaitl et al., “Fast and unified path gradient estimators for normalizing flows” ICLR 2024 [2] L. Vaitl “Path Gradient Estimators for Normalizing Flows" PhD Thesis [3] G. Roeder et al., "Sticking the landing: Simple, lower-variance gradient estimators for variational inference" NeurIPS (2017) [4] G.Tucker et al. "Doubly reparameterized gradient estimators for monte carlo objectives" ICLR 2019 [5] L. Vaitl et al. "Path-gradient estimators for continuous normalizing flows" ICML, 2022 [6] S. Mohamed et al. "Monte carlo gradient estimation in machine learning" JMLR 2020 [7] G. Papamakarios et al. "Normalizing flows for probabilistic modeling and inference" JMLR 2021 # Other concerns - X axis, figure caption, L206, too strong statement, L226 second set of experiments: Thanks for the input. We will fix these in the final manuscript. - Isn't "applying path gradients at the end" the same as fine-tuning? True, in the paper the difference is not very clear. In the first setting we used limited memory, compute and time constraints to compare FM and the hybrid approach. This is to show that the improvements do not only come from additional training. The second setting does not limit computational resources or time and training with PG starts after full training as done in Klein et al. 2023b. This was done to see how much PG can help training. We will clarify this in the final manuscript. - Difference between the models labeled Klein et al 2023b and Klein & Noe 2024. We agree with the reviewer that the difference should have been explained better in the paper. The architecture in Klein & Noe 2024 is mostly the same as in Klein et al. 2023b, with the main difference of the encoding of the atoms. While the model in Klein et al. 2023b treats the same atom type as indistinguishable, the model in Klein & Noe 2024 encodes nearly all atoms differently. Only hydrogens bound to the same atom are treated as indistinguishable. We will add description in the final version. - Paper Formatting We have not changed the template

2025-08-06

I appreciate the detailed explanation provided by the authors. I would request that a similar explanation, and a self-contained exposition of the background work (which is quite technical and not obvious to apply to new settings), be provided in the revision. If so, I am happy to raise the score.

2025-08-06

We want to stress that we provided a detailed explanation for the different estimators, as the reviewer requested. Furthermore, we would like to thank the reviewer for suggesting it, as we are convinced it improves the paper.

While we'll gladly expand the section on path gradients to include more background -- additionally to the changes promised to reviewer P3GD --, providing a fully self-contained exposition to background work is infeasible in this time-frame. Furthermore, Vaitl 2024 [2] already provides exactly that self-contained background, which we will highlight more in the final manuscript. This 100+ page thesis offers a comprehensive overview of the theory, estimators, objectives, algorithms, and experiments related to path gradients for normalizing flows.

To the best of our knowledge and apart from Agrawal et al. 2025 [8], our work is the first paper to focus on path gradients since Vaitl 2024 [2], so the information is up to date. We hope this context justifies our reliance on that prior work as a detailed reference.

[8] Agrawal et al. "Disentangling impact of capacity, objective, batchsize, estimators, and step-size on flow VI" AISTATS 2025

审稿意见

评分: 4置信度: 32025-06-30

In this paper, the authors propose using Flow Matching for pre-training and Path-Gradient during fine-tuning, in the case where the target energy is well-defined. The authors show that the proposed hybrid approach can significantly increase the sampling efficiency given the same computational budget on many particle systems. They further show that fine-tuning with Path Gradients does not affect the properties of the base model, while simultaneously improving the performance.

优缺点分析

Strengths:

The paper is well-structured, clearly written, and easy to follow.
The proposed hybrid approach relying on Path Gradients shows an improvement in the performance and the sampling efficiency, using the same model and without relying on additional samples.

Weaknesses:

The quantitative comparison in Table 1 is limited to comparing pre-trained OT-FM models with FM fine-tuning or with Path Gradient fine-tuning. Additional comparisons of a hybrid training using other base models would be helpful.
An important baseline that is missing in the experiments is the method proposed in Satorras et al. [1].
The authors suggest that Path Gradients are suitable for fine-tuning only when the flow is already close enough to the target distribution. However, this claim is not supported in the experimental setup. How does the final model perform when fine-tuning with Path Gradient is performed when the flow is not close to the target?
One notable weakness of the proposed approach (already addressed in the limitations section) is that it relies on the availability of a well-defined and differentiable energy function over the data domain, restricting its applicability to other domains.

[1] Garcia Satorras, Victor, et al. "E (n) equivariant normalizing flows." Advances in Neural Information Processing Systems 34 (2021): 4181-4192.

问题

What is the performance of the hybrid approach using an increased runtime?
Why do the values of ESS-Q and NLL in Figure 3 of the OT + Path Gradient fine-tuning baseline on LJ13 not match the results reported in Table 1?
What is the runtime of the hybrid approach fine-tuning (Table 1)?
While the authors suggest that they show how to compute Path Gradients with constant memory and without relying on additional samples, this was already addressed in prior work [2]. Can you clarify the novelty to Vaitl et al. [2]?

[2] Vaitl, Lorenz, et al. "Path-gradient estimators for continuous normalizing flows." International conference on machine learning. PMLR, 2022.

局限性

yes

最终评判理由

The authors have addressed my concerns. Many thanks!

格式问题

There are no major formatting issues

作者回复

2025-07-31

We thank the reviewer for their thoughtful comments and for highlighting the clarity and structure of the paper, as well as the practical advantages of our hybrid approach. We hope that our responses to the raised concerns help clarify our contributions and demonstrate the significance and robustness of the method.

Weaknesses:

"The quantitative comparison in Table 1 is limited to comparing pre-trained OT-FM models with FM fine-tuning or with Path Gradient fine-tuning. Additional comparisons of a hybrid training using other base models would be helpful."

We agree with the reviewer that additional comparisons would be helpful. So for the rebuttal we did additional experiments.

We evaluated our hybrid approach on the Dipeptide setting from Transferable Boltzmann generators [3]. This training set consists of 200 dipeptides and the evaluation is done on unseen dipeptides. Our approach yielded an improvement in the NLL on all evaluated molecules with an average improvement of 0.20, which is in line with our experiments on AD2. For additional information see our rebuttal for reviewer Fj6i.

Still, we want to note that in Section 4.2 we evaluated fine-tuning with EQ OT FM and Standard FM on LJ13 in addition to the comparisons in Table 1. For the experiments of Table 1 we stuck to OT FM because the ESS and NLL were similar for the three FM variants and Klein & Noe 2024 used OT FM.

The ability to optimize the flow trajectory length, as in OT-FM and EQOT-FM, is unique to FM approaches. These recent developments are of growing interest in the community. Our work seeks to understand their behavior under hybrid training. Other base models, like other CNFs or Coupling Type Flows were already evaluated in previous work [2,3,4] and showed improved performance with Path Gradients.

A hybrid approach applied to diffusion models and their Probability Flow ODE is a very promising direction for future work, but merits a thorough investigation.

"An important baseline that is missing in the experiments is the method proposed in Satorras et al. [1]."

We agree with the reviewer that the model in [1] is a good baseline and for the LJ13 we will gladly incorporate the baseline.

Method	$\text{NLL}$	$\text{ESS}_q$
Satorras et al. [1]	$-15.83 \pm 0.07$	$39.78 \pm 6.19$
Only FM [6]	$-16.09 \pm 0.03$	$54.36 \pm 5.43$
Hybrid approach (Ours)	$\mathbf{-16.21 \pm 0.00}$	$\mathbf{82.97 \pm 0.40}$
Note, that the model architectures for [1] and [6] are mostly the same. Klein et al. 2023 [6] already outperforms the model from [1] by using FM training instead of Maximum Likelihood (ML) training.

For the more challenging datasets, we therefore only compared the models proposed in Klein et al. 2023 [6] and Klein et al. 2024 [7], which outperform Satorras et al. [1].

"The authors suggest that Path Gradients are suitable for fine-tuning only when the flow is already close enough to the target distribution. However, this claim is not supported in the experimental setup. How does the final model perform when fine-tuning with Path Gradient is performed when the flow is not close to the target?"

We agree with the reviewer that Path Gradients (PG) can also improve training beyond the fine-tuning regime. Prior work (e.g., [2,3,4]) has demonstrated strong performance when training models exclusively with PG, outperforming other KL-based objectives (including Maximum Likelihood training).

Yet, our preference for a hybrid approach over exclusively PG is motivated by:

Time efficiency: FM is significantly faster per step than PG and ML.
Theoretical guarantees: PG has convergence guarantees only near the optimum (see Rebuttal to Reviewer DmNE for a thorough analysis).
Numerical stability: Like ML training, PG involves Jacobians and their derivatives which suffer from numerical instabilities, whereas FM is often more stable.

Especially the time efficiency was shown in our experiments:

In our GMM experiment (Section 3), we directly compare PG, FM, and the hybrid approach. FM trains faster initially, but PG eventually surpasses FM in final performance. The hybrid strategy outperforms PG alone by combining FM’s rapid early progress with PG’s refinement.
On the complex AD2-XTB (Table 1), we observe that, given limited time, FM outperforms fine-tuning with PG when using a less expressive model [3]. This supports our point: when the model is underfitting or far from the target distribution, PG fine-tuning provides limited benefit.

We will revise the manuscript to reflect this more nuanced perspective and avoid overstating the limitations of PG.

"One notable weakness of the proposed approach (already addressed in the limitations section) is that it relies on the availability of a well-defined and differentiable energy function over the data domain, restricting its applicability to other domains."

We agree with the reviewer. That said, the setting still covers a wide range of practically important domains like natural sciences – including drug discovery, material discovery or protein folding (e.g., AlphaFold) – and can even be applied to model distillation, like used in StableDiffusion3.

[1] Garcia Satorras, Victor, et al. "E (n) equivariant normalizing flows." Advances in Neural Information Processing Systems 34 (2021): 4181-4192.

Questions:

"What is the performance of the hybrid approach using an increased runtime?"

For LJ13 we show the results with increased runtime in Table 3. For AD2-classical we give results for TBG [7] in Table 2. Summarizing, we were able to obtain the following results

System	Time	Method	NLL	ESS_q
LJ13	limited	OT FM + PG	-16.21 ± 0.00	82.97 ± 0.40
LJ13	unlimited	EQ OT FM + PG	-16.23 ± 0.00	87.77 ± 0.24

AD2 classical	limited	OT FM + PG	-128.26 ± 0.02	24.39 ± 6.86
AD2 classical	unlimited	OT FM + PG	-128.33 ± 0.04	29.47 ± 2.24

Additionally, we ran the Hybrid approach on LJ55 once for 3 PG epochs instead of one and got the following performance NLL: -89.38, $ESS_q$ 25.44 %.

"Why do the values of ESS-Q and NLL in Figure 3 of the OT + Path Gradient fine-tuning baseline on LJ13 not match the results reported in Table 1?"

The settings are different:

In Table 1. we used a pretrained model (1000 epochs FM with lr 5e-4) which we fine-tuned with PG, limited batch-size (64) and time.
For the results in Figure 3, we used a model which was “fully trained” with FM (1000 epochs FM with lr 5e-4 and 1000 epochs with lr 5e-5) and finetuned with PG and batch-size 256. The larger batchsize means that an epoch had less training-steps.

"What is the runtime of the hybrid approach fine-tuning (Table 1)?"

We list the runtimes in A.4.

LJ13	LJ55	AD2
252 min	834 min	331 min
All done on A100.

"While the authors suggest that they show how to compute Path Gradients with constant memory and without relying on additional samples, this was already addressed in prior work [2]. Can you clarify the novelty to Vaitl et al. [2]?"

There are two major differences:

1.) While Vaitl et al. [2] and our work both use PG, they minimize a different loss on different samples. Our work optimizes the Forward KL $KL(p|q_\theta)$ with PG on existing samples from $p$ , as proposed in [3] using the algorithm proposed in [2]. Vaitl et al. [2] minimize the reverse KL divergence $KL(q_\theta|p)$ via self-sampling, i.e. on samples from the model $q_\theta$ . This has many drawbacks, mainly: modes of the target can be entirely missed, which invalidates all asymptotic guarantees and breaks importance sampling, see e.g. [5]. This becomes increasingly likely in higher dimensions.
2.) Since the publication [2], Flow Matching has been established as the de facto training method for CNFs. Our work investigates and combines the performance of Flow Matching and Path Gradients. Specifically, we investigate the effect of PG on the inner workings of the CNF, like e.g. trajectory length, while [2] simply looked at the performance compared to standard self-sampling losses.

That being said, forward KL based training can straightforwardly be combined with self-sampling based approaches (like [2]) to artificially increase the number of samples.

[2] Vaitl, Lorenz, et al. "Path-gradient estimators for continuous normalizing flows." International conference on machine learning. ICML, 2022.

[3] L. Vaitl et al., “Fast and unified path gradient estimators for normalizing flows”, ICLR 2024

[4] A. Agrawal et al., "Advances in black-box VI: Normalizing flows, importance weighting, and optimization." NeurIPS2020

[5] Detecting and Mitigating Mode-Collapse for Flow-based Sampling of Lattice Field Theories, Nicoli, Kim A. et al. PRD 23

[6] Klein, Leon et al. "Equivariant flow matching." NeurIPS (2023).

[7] Klein, Leon, and Frank Noé. "Transferable boltzmann generators." NeurIPS (2024)

评论- The authors have addressed most of the weaknesses

2025-08-04

Thank you for your answer and the clarification. I strongly recommend that the authors include the discussion of the novelty of the proposed approach to Vaitl et al. [1] in the manuscript.

I am increasing my score to 4.

[1] Vaitl, Lorenz, et al. "Path-gradient estimators for continuous normalizing flows." International conference on machine learning. ICML, 2022.

2025-08-06

We want to thank the reviewer for their valuable review. We're glad we could convince them with our rebuttal. We will adapt the manuscript to clarify the novelty of our approach.

审稿意见

评分: 5置信度: 42025-07-01

This paper proposes a hybrid training strategy for Continuous Normalizing Flows (CNFs) used as Boltzmann Generators for molecular systems. The authors identify that Flow Matching (FM) provides a computationally efficient way to pre-train a model using target samples, but it ignores valuable information from the target energy function. Conversely, training methods that use energy gradients, like Path Gradients (PG), are often computationally intensive. The proposed solution is to first pre-train a CNF with Flow Matching to rapidly obtain a good initial model, and then fine-tune this model using Path Gradients. This hybrid approach is shown to significantly improve the sampling efficiency of the resulting model—by up to a factor of three—within a similar computational budget as pure Flow Matching, all without requiring additional data. Furthermore, the work demonstrates that this fine-tuning process refines the model's accuracy while largely preserving the efficient, short integration paths learned during the Flow Matching phase.

优缺点分析

Strengths

The paper addresses the important and practical problem of how to best leverage all available information (both data samples and their energy gradients) for training Boltzmann Generators. The proposed hybrid approach is a simple, novel, and highly effective strategy that provides a clear recipe for improving upon existing state-of-the-art methods.
The technical quality of the work is high. The claims are supported by strong and clear experimental evidence on several standard molecular dynamics benchmarks.
The paper is very well-written and easy to follow. It provides a concise yet thorough background on the relevant concepts (CNFs, Flow Matching, Path Gradients) before clearly motivating and presenting the proposed hybrid method. The experiments are well-designed and the results are presented in a clear and compelling manner.

Weaknesses

The experiments are performed on well-established but relatively small molecular systems (Lennard-Jones clusters and Alanine dipeptide). While these are excellent for controlled comparisons, demonstrating the method's effectiveness on a larger and more complex system (e.g., a larger protein) would significantly bolster the paper's impact.

问题

The hybrid strategy requires switching from Flow Matching pre-training to Path Gradient fine-tuning. Your work appears to do this after a fixed number of epochs. Is there a more principled way or a clear heuristic to determine the optimal time to make this switch? For instance, should one monitor the FM loss for a plateau? A generalizable strategy for this would increase the practical utility of your method.
The performance gains are very impressive on the tested systems. How do you expect this hybrid approach to scale to much larger biomolecules? In such systems, force calculations are more expensive and the energy landscapes are far more complex. Do you hypothesize that the relative benefit of a few epochs of PG fine-tuning will increase or decrease in that regime?

局限性

Yes. The authors provide a dedicated and candid "Limitations" section 5.1. They correctly identify the main constraints of the method: it requires a differentiable energy function, it is computationally more expensive per step than Flow Matching, and it performs best as a fine-tuner when the model is already a reasonable approximation of the target distribution. This is a transparent and sufficient discussion of the work's scope.

最终评判理由

I had a full discussion with the author and I kept my score.

格式问题

No Paper Formatting Concerns.

作者回复

2025-07-31

We thank the reviewer for their thoughtful and encouraging feedback. We appreciate them pointing out the practical relevance of the problem, the strength of the experimental evidence, and the clarity of the presentation.

Weaknesses:

"The experiments are performed on well-established but relatively small molecular systems (Lennard-Jones clusters and Alanine dipeptide). While these are excellent for controlled comparisons, demonstrating the method's effectiveness on a larger and more complex system (e.g., a larger protein) would significantly bolster the paper's impact."

We agree that additional and more challenging systems improve the impact of the paper. So we additionally evaluated the performance of our approach on the Dipeptides dataset in (Klein & Noe, 2024) to test the transferability to unseen systems. This is currently the largest system Boltzmann Generators have shown transferability for.

Again, we fine-tune the available pretrained Boltzmann Generator from (Klein & Noe 2024) with path gradients. This transferable Boltzmann Generator is trained on a subset of all possible dipeptides and evaluated on unseen ones. Our experiments show that fine-tuning with path gradients improves the NLL and energies for all evaluated unseen test dipeptides and also shows average improvements for the ESS from $7.96$ to $9.79$ .

The full results can be found in our rebuttal to reviewer Fj6i.

Questions:

"The hybrid strategy requires switching from Flow Matching pre-training to Path Gradient fine-tuning. Your work appears to do this after a fixed number of epochs. Is there a more principled way or a clear heuristic to determine the optimal time to make this switch? For instance, should one monitor the FM loss for a plateau? A generalizable strategy for this would increase the practical utility of your method."

This is a good idea. In principle, monitoring the FM loss plateau is a reasonable heuristic and inexpensive to implement. However, our primary focus was understanding and comparing the behavior of PG, FM, and their combination—rather than on optimizing the switching criterion itself. For the experiments in Section 4.1, we simply used half the available training time for FM and PG each, for Section 4.2 we used the training regime in (Klein et al. 2023b) for pretraining with FM.

While designing a more principled or generalizable strategy is an interesting direction, its effectiveness can depend on the dataset, model, and optimization dynamics. We therefore leave this to future work.

"The performance gains are very impressive on the tested systems. How do you expect this hybrid approach to scale to much larger biomolecules? In such systems, force calculations are more expensive and the energy landscapes are far more complex. Do you hypothesize that the relative benefit of a few epochs of PG fine-tuning will increase or decrease in that regime?"

We thank the reviewer for this insightful question. We believe our hybrid approach has two key advantages that support its scalability to larger systems:

Path Gradients typically scale well with dimensionality. They have successfully been applied to high-dimensional systems to simulate Quantum Field Theories (>10^3 dimensions [1,2] and even >10^5 dimensions [3,4]). Further, our additional experiments on the unseen dipeptides show a positive correlation of 0.53 between the NLL gain per dimension and the number of particles. This is a good indicator that fine-tuning with PG scales well to higher dimensions for molecules as well.
A practical strength of our method is that force evaluations are required only once. In many real-world applications, these forces are already available from prior simulations, like in MD or MCMC sampling where forces are required repeatedly. This “recycling” makes our method especially attractive when force evaluations are expensive, as is the case for large biomolecular systems.

However, the effectiveness of Path Gradients remains limited by the expressiveness of the model. If the energy landscape is too complex to be approximated well, we do not expect fine-tuning with PG to resolve this fundamental mismatch. We thus expect the benefits of Path Gradients to extend to larger molecular systems - as long as the model’s capacity is sufficient to capture the complex energy landscape of the target distribution.

[1] Vaitl, Lorenz, et al. "Path-gradient estimators for continuous normalizing flows." International conference on machine learning. ICML, 2022.

[2] Bacchio, Simone, et al. "Learning trivializing gradient flows for lattice gauge theories." Physical Review D 107.5 (2023): L051504

[3] Abbott, Ryan, et al. "Normalizing flows for lattice gauge theory in arbitrary space-time dimension." arXiv preprint arXiv:2305.02402 (2023).

[4] L. Vaitl “Path Gradient Estimators for Normalizing Flows", PhD Thesis

2025-08-05

Thank you to the authors for their detailed rebuttal. Their response has effectively addressed my concerns, and I am satisfied with their commitment to incorporate the corresponding revisions and clarifications in the final version. Therefore, I will maintain my original score and continue to recommend this paper for acceptance.

2025-08-06

Thank you for reviewing and the encouraging comments

审稿意见

评分: 5置信度: 42025-07-11

This work proposes a method to perform fine-tuning of a pretrained CNF model sampling a Boltzmann distribution via path-gradients (PG). Because computing PG only requires the data x_1 and unnormalized energy gradients \nabla_{x_1} E(x_1), (which are supposedly already computed to generate training data), PG requires no additional compute in terms of energy evaluations. By fine-tuning with PG, the authors show a significant increase in the effective sample size (ESS) on a number of well-studied energy functions when compared to flow-matching or path-gradients alone.

优缺点分析

Strengths:

The method is easy to understand and performing path-gradients yields a large boost in effective sample size (ESS).
Unlike importance sampling or relaxation, Path-gradients require no additional energy evaluation and preserves the target Boltzmann distribution, given that the model is a CNF trained on unbiased data.

Weaknesses:

It seems that the main benefit of path-gradient fine-tuning is that it improves the efficiency of sampling high quality samples at inference time of a flow model that is already quite good. This is not really a bottleneck in scaling Boltzmann generators since it is quite cheap to sample the diffusion model once its trained. For Lennard Jones and Alanine dipeptide, the energy is pretty cheap to evaluate and so importance sampling is straight-forward. It’s hard for me to see how this technique would scale up to harder problems we are unable to approach with flow/diffusion samplers (such as Chignolin in [1]), if it requires the flow model to already be quite good and unbiased.
Requiring unbiased data seems like a major downside to this method, as this assumes you should already have a way to generate true samples via MCMC or otherwise. I think a more relevant situation would be when you have some biased data (due to limitations of another approach) and want to improve sampling towards the desired boltzmann, even at the cost of additional evaluation (see [1]).

[1] Tan et al 2025 "Scalable Equilibrium Sampling with Sequential Boltzmann Generators"

问题

The path gradient is based on the backwards KL, but the flow-matching marginal vector-field is not minimizing the same objective. Yet it is claimed that the structure of the flow isnt changed very much. Do you have intuition about what happens to the marginal of the path in flow matching?
Similar to my concern above: If this is considered to be a SOTA method for Boltzmann generators, how do path-gradient scale on larger systems which current diffusion samplers have trouble with (larger peptides like Chignolin)? why not compare to other strategies for fine-tuning / reweighting such as “Scalable Equilibrium Sampling with Sequential Boltzmann Generators”. These methods do not even require unbiased data, but at the cost of additional energy evaluations.

局限性

Yes

最终评判理由

My concerns were addressed and they provided some interesting additional experiments on transferable energies. I think its a good paper.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their positive feedback. We're glad the clarity and practical value of our method came through.

Weaknesses

"It seems that the main benefit of path-gradient fine-tuning is that it improves the efficiency of sampling high quality samples at inference time of a flow model that is already quite good. This is not really a bottleneck in scaling Boltzmann generators since it is quite cheap to sample the diffusion model once its trained. For Lennard Jones and Alanine dipeptide, the energy is pretty cheap to evaluate and so importance sampling is straight-forward. It’s hard for me to see how this technique would scale up to harder problems we are unable to approach with flow/diffusion samplers (such as Chignolin in [1]), if it requires the flow model to already be quite good and unbiased."

We agree that path-gradient fine-tuning improves the efficiency of the Boltzmann Generators. However, we disagree that this does not solve the major bottleneck of sampling speed.

The inference with Boltzmann Generators based on CNFs is expensive if one aims to do importance sampling. The major bottleneck is usually not the energy evaluation, which can be cheap, but the integration of the trace of the Jacobian (eq. (6)). Hence, having access to a better Boltzmann Generator, such as one fine-tuned with path-gradients, allows faster sampling as less model evaluations are required per effective sample.

Even when avoiding correction via importance sampling, the samples are significantly closer to the reference (see Figure 4 right) and the target is approximated better. So, while our training approach is not a panaceum for samplers, in settings where flow models are already useful, we showed that we can obtain significant gains in performance and soften the bottleneck for importance sampling.

"Requiring unbiased data seems like a major downside to this method, as this assumes you should already have a way to generate true samples via MCMC or otherwise. I think a more relevant situation would be when you have some biased data (due to limitations of another approach) and want to improve sampling towards the desired boltzmann, even at the cost of additional evaluation (see [1])."

We appreciate the reviewer’s point, relying on unbiased samples is indeed a limitation. However, we’d like to clarify a few key aspects, to put this downside into perspective:

In a setting where you have access to the derivative of the energy, you can always apply MCMC or MD. The settings with Boltzmann Generators, as well as the approach by (Tan et al. 2025) [1] or protein folding fall in this regime. The issue with MCMC or MD is that they are not easily parallelizable and exhibit a high autocorrelation. Deep generative models simply aim to speed up the process. Sample-efficient training methods, like Path Gradients, are of great value in this setting.
In practice, a small bias in the training data is tolerable. For example, in our AD2 and TBG experiments, the training data likely contained mild sampling bias due to finite-length MD runs, yet the models performed well after PG fine-tuning. In additional experiments on dipeptides that we did for this rebuttal, we also saw an improved performance even though the “dipeptides, which were simulated each with a classical force field for 50 ns and, therefore, may not have reached convergence” (Klein & Noe 2024).
In these additional experiments we further investigated the transferability to unseen Dipeptides and saw that PG was able to generalize to unseen systems, further relaxing this downside. In the transferable setting, the model needs only equilibrium samples from the training systems, no data is required for the unseen test systems to which it transfers. Below we summarized the additional experiments.

[1] Tan et al 2025 "Scalable Equilibrium Sampling with Sequential Boltzmann Generators"

Questions:

"The path gradient is based on the backwards KL, but the flow-matching marginal vector-field is not minimizing the same objective. Yet it is claimed that the structure of the flow isnt changed very much. Do you have intuition about what happens to the marginal of the path in flow matching?"

Indeed, Flow Matching (FM) and Path Gradients (PG) optimize different objectives:

FM minimizes a trajectory-level MSE to match a reference trajectory $u_t$ , while PG minimizes the KL divergence between the induced distributions $q_\theta$ at time $t=1$ and $p$ . Since different trajectories can lead to the same transformation (e.g., via time reparametrization and early stopping in $t$ ), the KL is agnostic to the actual trajectory for $t \in [0,1)$ , i.e. it depends only on the endpoint of the transformation at $t=1$ .

So for a minimal FM loss the KL is also minimized, while the reverse does not hold. Our experiments suggest that when PG is used after FM pretraining, the variational distribution $q_\theta$ is already close to the target distribution $p$ , and PG fine-tuning adjusts the trajectory only slightly without deviating too much from the reference trajectory $u_t$ .

"Similar to my concern above: If this is considered to be a SOTA method for Boltzmann generators, how do path-gradient scale on larger systems which current diffusion samplers have trouble with (larger peptides like Chignolin)? why not compare to other strategies for fine-tuning / reweighting such as “Scalable Equilibrium Sampling with Sequential Boltzmann Generators”. These methods do not even require unbiased data, but at the cost of additional energy evaluations."

We agree with the reviewer that scalability is an open question. However, scaling to larger systems often comes with high computational costs. In this work, we aim to show how to improve current Boltzmann Generators with path-gradient fine-tuning and focus on the architecture trained with flow matching. However, path-gradient fine-tuning is also applicable to a large class of Normalizing Flows architectures (see [7]).

PGs have been demonstrated to perform well in high dimensional settings e.g. in O(10^3) dimensions in [2,3] or even in O(10^5) in [4,5,6] in quantum field theory. Still, Chapter 6 of [6] suggests that a limiting factor is often the complexity of the target density compared to the modeling power of the model.

As the reviewer mentions, the method in [1] requires additional energy evaluations compared to traditional BGs, which can be a bottleneck depending on the energy function. Importantly, for the path-gradient training, no additional energy evaluations are required during training, as the forces of the training sample are used.

Further, note that PG-training can readily be combined with the non-equilibrium sampling scheme from Proposition 2 in [1], since the former is a training scheme and the latter is a test-time sampling scheme.

Additional Dipeptide Experiments:

We applied our hybrid approach to transferable Boltzmann Generators on dipeptides as introduced in Klein et al. 2024. We again fine-tune the pretrained Boltzmann Generator from Klein et al. 2024 with path gradients. Training took 6 days on an A100. This transferable Boltzmann Generator is trained on a subset of all possible dipeptides and evaluated on unseen ones. Our experiments show that fine-tuning with path gradients improves the NLL and energies for all evaluated unseen test dipeptides and also shows average improvements for the ESS. We will include these findings in the final version. For more detailed results, see table below:

Dipeptide	NLL (Before)	NLL (After PG)	ESS (Before)	ESS (After PG)
AC	-63.90	-63.91	31.8	33.3
AT	-75.49	-75.61	20.8	21.5
ET	-89.94	-90.08	6.3	4.1
GN	-65.45	-65.55	19.2	25.6
GP	-71.63	-71.77	10.5	8.0
HT	-103.06	-103.27	0.2	5.5
IM	-106.80	-107.01	3.1	4.6
KG	-88.60	-88.79	5.0	7.4
KQ	-119.36	-119.68	4.3	2.0
KS	-100.82	-100.99	6.0	4.6
LW	-148.91	-149.22	0.4	3.6
NF	-116.45	-116.67	3.7	12.2
NY	-118.24	-118.47	9.5	10.5
RL	-135.71	-136.12	1.3	1.5
RV	-126.28	-126.62	1.3	0.6
TD	-81.09	-81.19	4.1	11.6

Average	-100.73	-100.93	7.96	9.79

[2] Vaitl, Lorenz, et al. "Path-gradient estimators for continuous normalizing flows." International conference on machine learning. ICML, 2022.

[3] Bacchio, Simone, et al. "Learning trivializing gradient flows for lattice gauge theories." Physical Review D 107.5 (2023): L051504

[4] Abbott, Ryan, et al. "Applications of flow models to the generation of correlated lattice QCD ensembles." Physical Review D 109.9 (2024): 094514.

[5] Abbott, Ryan, et al. "Normalizing flows for lattice gauge theory in arbitrary space-time dimension." arXiv preprint arXiv:2305.02402 (2023).

[6] L. Vaitl “Path Gradient Estimators for Normalizing Flows", PhD Thesis

[7] L. Vaitl et al., “Fast and unified path gradient estimators for normalizing flows”, ICLR 2024

2025-08-05

I want to thank the authors for the clarification and additional experiments. They're quite interesting.

While I agree that increased sampling efficiency will be important for a downstream importance sampling task, it seems that it isn't really demonstrated in this setup (which is just applying path-gradients afterward flow-matching). I can imagine using this path-gradient approach to continually improve a generative model as part of the training loop itself. It might be useful to include some more discussion about this down-stream utility in the final paper.

Also, I do see that it improves the quality of samples as shown by the histogram in fig. 4. I find these types of visualization very informative and it would be great to have them for all of the systems you considered in Table 1 in the final paper (I wouldn't expect to see those as this stage of course). I would be curious about what it looks like for LJ55, as that energy profile is usually pretty tough to match exactly.

Given that these things are addressed, I plan to raise my score accordingly.

2025-08-06

We want to thank the reviewer again for their review and are happy that they find the additional experiments and clarification helpful. We will include the experiments as well as all energy histograms in the final version.

最终决定Accept (poster)

2025-09-17

The authors present a method to augment Boltzmann Generators using the output of MD simulations, requiring only an initial energy and gradients of the energy along a sampled path. They demonstrate good results on model systems like alanine dipeptide. While the reviewers raised concerns about scalability and applicability in situations where a good energy function was not available, the authors managed to win many of the reviewers over in the rebuttal period. There was a consensus that the paper should be accepted, which I endorse.

Path Gradients after Flow Matching

摘要

评审与讨论

优缺点分析

问题

局限性

最终评判理由

格式问题

The Maximum Likelihood and Path Gradient Estimators.

Variance of the estimators:

Gradient estimators at the optimum

Why is that?

Variance of GFM\mathcal G_{FM}GFM​ for a toy example

Proof:

优缺点分析

问题

局限性

最终评判理由

格式问题

Weaknesses:

"The quantitative comparison in Table 1 is limited to comparing pre-trained OT-FM models with FM fine-tuning or with Path Gradient fine-tuning. Additional comparisons of a hybrid training using other base models would be helpful."

"An important baseline that is missing in the experiments is the method proposed in Satorras et al. [1]."

"One notable weakness of the proposed approach (already addressed in the limitations section) is that it relies on the availability of a well-defined and differentiable energy function over the data domain, restricting its applicability to other domains."

Questions:

"What is the performance of the hybrid approach using an increased runtime?"

"Why do the values of ESS-Q and NLL in Figure 3 of the OT + Path Gradient fine-tuning baseline on LJ13 not match the results reported in Table 1?"

"What is the runtime of the hybrid approach fine-tuning (Table 1)?"

"While the authors suggest that they show how to compute Path Gradients with constant memory and without relying on additional samples, this was already addressed in prior work [2]. Can you clarify the novelty to Vaitl et al. [2]?"

优缺点分析

问题

局限性

最终评判理由

格式问题

Weaknesses:

Questions:

优缺点分析

问题

局限性

最终评判理由

格式问题

Weaknesses

Questions:

"The path gradient is based on the backwards KL, but the flow-matching marginal vector-field is not minimizing the same objective. Yet it is claimed that the structure of the flow isnt changed very much. Do you have intuition about what happens to the marginal of the path in flow matching?"

Additional Dipeptide Experiments:

Variance of $\mathcal G_{FM}$ for a toy example