6.8

/10

Spotlight5 位审稿人

最低6最高7标准差0.4

2.8

置信度

正确性2.8

贡献度2.8

表达3.4

NeurIPS 2024

Doob's Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling

Yuanqi Du,Michael Plainer,Rob Brekelmans,Chenru Duan,Frank Noe,Carla P Gomes,Alan Aspuru-Guzik,Kirill Neklyudov

OpenReview PDF

提交: 2024-05-15更新: 2024-12-27

TL;DR

Lagrangian formulation of Doob's h-transform allowing for an efficient rare event sampling

摘要

Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob's $h$-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob's $h$-transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.

关键词

Transition Path SamplingProtein FoldingSchrödinger Bridge

评审与讨论

审稿意见

评分: 7置信度: 32024-06-20

The paper is concerned with sampling trajectories with a terminal condition. For stochastic processes governed by a Brownian Motion, Doob's h-transform gives an posterior SDE that leads to samples with the final condition. However, estimating the h-function that is needed for the posterior SDE usually involves simulating trajectories, which is inefficient if the terminal condition is rarely reached. The authors propose a simulation-free variational optimization method to estimate the h-function based on a least action principle and Gaussian approximations to the marginal densities.

优点

The authors provide a solid and clear background on Doob's h transform that gets supported by the provided proofs in the appendix.
The paper clearly highlights the challenges of the optimization problem and proposes an efficient solution adressing these challenges.
The related work section gives a good overview and nicely connects to related topics.

缺点

I found the path histograms in Figure 2 to be too cluttered. I would propose adding less samples.
The paper misses a learning curve. It would in general be interesting to have more training details.

Minor Weaknesses:

In Chapter 3 and in the appendix, the authors change from trajectory length T to the unit interval. There should be a sentence that explains this change.

问题

I could not totally follow the derivation in the appendix. Can you explain how to get to Equation 25 from the previous equation in line 548? I do not see why second term in equation 25 gets subtracted, while in line 548 only the last term has a minus.

局限性

The authors adequately adressed the limitations of their work.

作者回复

2024-08-07

Q1. I found the path histograms in Figure 2 to be too cluttered. I would propose adding less samples.

Thank you for your feedback and the concrete suggestion. Our goal was to illustrate the diversity of the ensemble of transition paths. We will revise Figure 2 by reducing the number of paths and will highlight example trajectories in different colors. This will help visualize the converged behavior while allowing for the investigation of individual transitions.

Q2. The paper misses a learning curve. It would in general be interesting to have more training details.

We agree with you and have uploaded a PDF containing training curves to provide more insight. While the loss itself may not be very revealing, we showcase the quality of paths (i.e., max energy) at a certain compute budget (i.e., number of potential evaluations).

Additionally, we will include two algorithms in the revised manuscript to clarify details on the training and inference. These algorithms are also included in the rebuttal PDF.

Q3. In Chapter 3 and in the appendix, the authors change from trajectory length T to the unit interval. There should be a sentence that explains this change.

Thank you for pointing out this inconsistency. We have corrected this error by unifying the notation throughout the paper to consistently consider trajectories of length $T$ .

Q4. I could not totally follow the derivation in the appendix. Can you explain how to get to Equation 25 from the previous equation in line 548? I do not see why second term in equation 25 gets subtracted, while in line 548 only the last term has a minus.

After the substitution of (24) into the equation in line 548, the third term becomes $\int dtdx\ s_t \langle\nabla,q_t(b_t + 2G_t\nabla s_t)\rangle = \int dtdx\ s_t \langle\nabla,q_t b_t\rangle - 2\int dtdx\ q_t\langle\nabla s_t,G_t\nabla s_t\rangle,$ where the last equality is due to the integration by parts. The last term in this equation, together with the first term in line 548, yields the second term in equation (25).

2024-08-13

Thank you for your response. I will increase my original score by one.

审稿意见

评分: 7置信度: 22024-07-02

This paper proposes a variational formulation of Doob's h-transform, which characterizes the distribution over paths with a given endpoint. Instead of relying on potentially wasteful sampling approaches, the authors propose directly optimizing a tractable variational distribution over transition paths which satisfy the initial and terminal conditions by design. This approach reduces the search space over trajectories and avoids trajectory simulation. Experiments on real-world molecular simulation and protein folding tasks demonstrate the applicability of the approach.

优点

The paper clearly motivates the problem it tackles, describes the challenges well and nicely introduces the idea behind their method in an illustrative fashion. Overall the paper is very well-written and structured. In particular, it builds up the method piece-by-piece explaining the choices along the way. As far as I can judge the related work seems to be exhaustive. The experiments are done on interesting problems as far as I can tell and I also appreciated that the authors made their code publicly available.

缺点

My main concern comes from trying to interpret the experimental results and in particular judging the performance compared to MCMC. It seems quite clear from Tables 1 and 2 that the variational approach requires fewer calls to the potential energy function than MCMC, however their performance differences are harder to judge in my opinion. More specifically, in Table 1 the standard deviations are so large that there is basically no meaningful statistical difference between the shown results especially for the Max Energy, but also the Log-likelihood. Now this might mean that MCMC and the presented method both perform well, but its surprising to me that there is that much inherent variation. Similarly in Table 2, the variance for the Max Energy of MCMC in the first line is huge. I'm also a bit confused by the Max Energy increases when using a mixture in Table 2. To better understand how the performance of the variational approach improves during training, it would be nice to see a plot that shows the Max Energy as a function of the training epochs.

Finally, I would have liked to see a discussion of the limitations of the approach. Section 6 has Limitations in the title, but does not actually discuss them in any way beyond extensions of the proposed method.

问题

How do you choose the number of mixture components in practice?
Why does the MaxEnergy increase when using a mixture distribution as a variational approximation in Table 2?
How do you explain the huge variance of the Max Energy of MCMC (variable length) in the first row of Table 2?
What are the main limitations of the current approach?

局限性

Section 6 has "Limitations" in the title, but limitations are not discussed, only future work.

作者回复

2024-08-07

Q1. To better understand how the performance of the variational approach improves during training, it would be nice to see a plot that shows the Max Energy as a function of the training epochs.

Please see the supplementary PDF for the plot of max energy as a function of energy evaluations. We will include this plot in the final version of the paper.

Q2. More specifically, in Table 1 the standard deviations are so large that there is basically no meaningful statistical difference between the shown results especially for the Max Energy, but also the Log-likelihood.

The variance of the max energy (and log-likelihood) is measured over independently sampled paths from the target path-measure (the Brownian motion conditioned on the endpoint). Note that the large variance is not due to inaccurate measurements or poor performance of the model but due to the diffusion coefficient of the reference measure. Thus, we expect our model to match these numbers requiring fewer energy evaluations rather than surpass the MCMC results in terms of energy or likelihood. Therefore, the performance measure in Table 1 is the number of evaluations.

Q3. How do you explain the huge variance of the Max Energy of MCMC (variable length) in the first row of Table 2?

In Table 2, the variance of the max energy is measured across paths of different lengths, which gives very different estimates of the maximum energy. Indeed, when the path’s length is not optimal (too short or too long paths), the maximum energy is very high (the trajectory either has to take shortcuts or can wander into high energy regions). Note that these paths do not affect the estimate of the minimum energy but contribute to the estimate of its mean and variance, which results in the enormous value of the latter.

Q4. Why does the MaxEnergy increase when using a mixture distribution as a variational approximation in Table 2?

For a single component, our algorithm samples from a low-energy transition path (mode-seeking behavior). When we add more components into the mixture, it starts including less likely, higher-energy paths, covering other modes and increasing the variance and mean. Note that min-max energy does not change with the introduction of several components.

Q5. Finally, I would have liked to see a discussion of the limitations of the approach. Section 6 has Limitations in the title, but does not actually discuss them in any way beyond extensions of the proposed method.

Thank you for the suggestion. Indeed, currently, the discussion is focused mostly on the future work rather than limitations. We will address this issue in the revised manuscript: (1) computational inefficiency in learning a mixture of Gaussian paths, (2) as already pointed out in part of our future work, the rigidity in defining states A and B to be a point mass with Gaussian instead of any arbitrary set, and (3) as also mentioned in the future work section, our method is limited to a fixed length of transition path instead of varied-length.

Q6. How do you choose the number of mixture components in practice?

The number of mixture components is a hyperparameter that should be chosen based on the complexity of the system and the available computational budget. Increasing the number of Gaussian mixtures enhances expressivity and improves the model's ability to capture diverse transition paths.

When using learnable weights $w$ , less-dominant reaction channels receive smaller weights during training. In our toy experiment in Figure 3, we observed that increasing the number of mixtures beyond the number of reactive channels resulted in only the first two mixtures having significant weights (around 0.5), while higher mixtures had weights close to zero. This suggests that the weights can be used as a proxy to determine the optimal number of components, similar to how principal component analysis identifies significant components and discards less-likely channels.

2024-08-08

Thank you for your response. I feel my original score is still appropriate.

审稿意见

评分: 7置信度: 32024-07-08

The Authors of this paper tackle the problem of sampling conditioned SDEs with a specific interest in "transition path sampling", i.e. sampling a Langevin-type SDE undergoing a transition between an initial state (or set of states) $A$ and a final target set $B$ . Sampling transition paths efficiently can provide a tremendous boost to research in catalysis or drug design. In this paper, the Authors model the transition path distribution with either (i) a parametrized Gaussian process or (ii) a parametrized mixture of Gaussians processes. These families of priors are then optimized by leveraging a variational formulation of the problem developed by the Authors starting from Doob's $h$ -transform.

优点

The Authors tackle a challenging problem with a novel variational formulation, which is apt to be optimized by leveraging techniques developed by the generative modeling community within ML. A clear strength of the paper is the sound theoretical analysis justifying the proposed method. Interestingly, the choice of Gaussian process (or a mixture of Gaussian processes) variational priors allows the Authors to simplify the algorithm using analytical results and sidestep lengthy calculations at the deployment phase.

缺点

I find the paper overall clearly written, but I had to go through section 3.2 (Computational Approach) several times to grasp how the method can be deployed in practice. Specifically, I think that it can be improved the explanation of the fact that, because of the Gaussian prior, you only need to model the transition path probability and not the $h$ -transform itself, as well as the explanation of the actual optimization step (Reparametrization of Gradients).

The subscript $0,T$ is introduced for the first time in Eq. (6) without explanation and used throughout the manuscript to label variables related to the conditioned process. It might be useful to clarify this notation explicitly.

The experimental evaluation is somewhat limited, especially concerning the Chignolin protein. Over the years, many transition path sampling strategies have been developed, as well as very much related enhanced sampling techniques. It would be nice to have a comparison also to different baselines, as well as a discussion on the computational complexity and actual running times of the baselines.

问题

Is the $h$ -transform related to the committor function in some way? If so, how does your Thm 1 relates to the variational formulation of the committor (see e.g. Eq. 20 of "Transition Path Theory and Path-Finding Algorithms for the Study of Rare Events" by Weinan E and Eric Vanden-Eijnden)

局限性

Limited experimental evaluation

作者回复

2024-08-07

Q1. I find the paper overall clearly written, but I had to go through section 3.2 (Computational Approach) several times to grasp how the method can be deployed in practice. Specifically, I think that it can be improved the explanation of the fact that, because of the Gaussian prior, you only need to model the transition path probability and not the h-transform itself, as well as the explanation of the actual optimization step (Reparametrization of Gradients).

Thank you very much for the concrete suggestion. We will improve the presentation to emphasize that we only optimize for $q$ due to the Gaussian parameterization. In the revised manuscript, we also introduce two algorithms (which can be found in the PDF uploaded during the rebuttal) detailing the training and inference procedure. These additions should make our approach more understandable. We welcome any further suggestions you might have on making the paper more accessible.

Q2. The subscript 0, T is introduced for the first time in Eq. (6) without explanation and used throughout the manuscript to label variables related to the conditioned process. It might be useful to clarify this notation explicitly.

Thank you for pointing this out. We have updated the manuscript to explicitly introduce and clarify the subscript notation. We hope this makes the notation clearer and easier to follow throughout the manuscript.

Q3. The experimental evaluation is somewhat limited, especially concerning the Chignolin protein. Over the years, many transition path sampling strategies have been developed, as well as very much related enhanced sampling techniques. It would be nice to have a comparison also to different baselines, as well as a discussion on the computational complexity and actual running times of the baselines.

Thank you for raising your concerns. There is minimal difference among MCMC methods regarding the quality of paths, as they all guarantee convergence to the true distribution. We are unaware of any non-MCMC-based transition path sampling procedures that significantly outperform existing techniques in terms of runtime and trajectory quality.

Thus, we view MCMC results not as baselines to surpass but as a gold standard to approximate. Comparing with different variations of MCMC (e.g., improved shooting-point selection [1,2], machine-learned MCMC [3]) might reduce the number of evaluations needed but would not affect the theoretical guarantees. Since the effectiveness of these techniques varies greatly across different systems, we did not include such comparisons. However, we will discuss the computational complexity and running times in the revised manuscript to provide a clearer picture of the performance trade-offs involved.

As for Chignolin, we showcased we could find plausible paths with a reasonable amount of energy evaluations. However, due to the already high dimensionality of Chignolin it is challenging to sample a meaningful ensemble of transition paths with existing methods. This highlights the advantage of our approach.

Q4. Is the h-transform related to the committor function in some way? If so, how does your Thm 1 relates to the variational formulation of the committor (see e.g. Eq. 20 of "Transition Path Theory and Path-Finding Algorithms for the Study of Rare Events" by Weinan E and Eric Vanden-Eijnden)

The committor functions defined in eq. (18) of [4] is different from the h-transform. Indeed, the committor function $q_{+}$ is time-independent and satisfies eq. (18) in [4], while the PDE for the h-transform (eq. (8b) of our paper) includes the time-derivative. The committor function can be obtained as the integration of $h$ over different event times $T$ , however, this is beyond the scope of the current paper.

References

[1] P. G. Bolhuis, D. Chandler, C. Dellago, and P. L. Geissler, 2002, “TRANSITION PATH SAMPLING: Throwing Ropes Over Rough Mountain Passes, in the Dark” Annual Review of Physical Chemistry, vol. 53, no. 1. Annual Reviews, pp. 291–318.

[2] J. Juraszek and P. G. Bolhuis, 2008, “Rate Constant and Reaction Coordinate of Trp-Cage Folding in Explicit Water,” Biophysical Journal, vol. 95, no. 9. Elsevier BV, pp. 4246–4257..

[3] H. Jung et al., 2023, “Machine-guided path sampling to discover mechanisms of molecular self-organization,” Nature Computational Science, vol. 3, no. 4. Springer Science and Business Media LLC, pp. 334–345.

[4] Vanden-Eijnden, Eric. "Transition-path theory and path-finding algorithms for the study of rare events." Annual review of physical chemistry 61 (2010): 391-420.

2024-08-08

I thank the Authors for their thorough rebuttal and clarifications. I will keep my evaluation to "Accept".

审稿意见

评分: 6置信度: 32024-07-10

The submitted manuscript presents a variational formulation of Doob’s h-transform, leading to a novel (simulation-free) computational approach for rare event sampling in transition paths. The task of interest involves conditioning a dynamical system driven by Brownian motion with a known drift term to reach a given endpoint. In theory, this terminal condition can be addressed using Doob’s h-transform, resulting in an associated SDE for the conditional dynamic. However, practical implementation requires knowledge of the h-transform. The authors propose a variational problem formulation that provides the necessary information about the h-transform as its solution. Solving this variational problem results in a computational approach for simulating the conditional dynamic. Compared to existing methods, it is claimed that the proposed approach avoids importance sampling estimators and expensive trajectory simulations. The method is tested on both synthetic and real datasets.

优点

The paper introduces a novel variational objective to describe Doob’s h-transform, leading to a promising and innovative computational approach for simulating the conditional dynamics of transition paths. Although I appreciate the general approach presented in Section 3.1, I am not convinced by the computational approach proposed in Section 3.2 due to several reasons pointed out below. If these issues can be addressed, I believe that the approach can be further utilized to construct highly efficient sampling strategies.

缺点

The proposed approach in Section 3.2.1 introduces certain issues. The connection to Doob’s h-transform established in Theorem 1 is only guaranteed if an exact solution to the optimization problem (9) or (11) is found. However, by introducing the Gaussian parametrization of $q_{t|0,T}$ , this guarantee is lost. There is a lack of discussion regarding the implications and potential effects of this parametrization on the overall accuracy and validity of the method.
Moreover, the formulation of the task of solving equation (12) given $q_{t|0,T}$ is misleading. In reality, you are solving an inverse problem here. Given the solution of the PDE in (12), your goal is to reconstruct $u_{t|0,T}$ , which is generally an ill-posed problem. For instance, introducing uncertainties into the model description via the drift $b_t$ or diffusion $\Xi_t$ can lead to significant challenges. This scenario is likely when applying the approach in practice under model misspecification. A critical question to address is the robustness of the proposed framework against model misspecification, such as small perturbations in $b_t$ .
The claim that the approach is simulation-free is not entirely clear. In line 173, it is stated that the Gaussian parametrization allows for the generation of arbitrary samples. However, this assumes that the states $x_{t|0,T}$ are independent for all $t$ , which is not true when $x_{t|0,T}$ is a solution of the SDE in (10). The dependence between the states needs to be accounted for in the sampling process, and this aspect seems to be overlooked.
I am wondering what the measured quantities in the numerical experiments actually reveal about the correctness of the proposed approach. Evaluating the maximum energy and the likelihood might not provide sufficient information about the accuracy of the estimated distribution. If the sampled paths only follow high likelihood regions, this does not necessarily indicate that the correct distribution has been captured. The objective should be to estimate quantities of interest related to the conditioned SDE in (6). However, it is not clear whether the simulated paths correspond to accurate simulations of (6). For instance, how does the method perform when estimating rare event probabilities or other Monte Carlo estimators with respect to (6)? This aspect needs to be thoroughly addressed to validate the effectiveness of the proposed approach.

问题

To enhance the practical applicability of the proposed framework, it is crucial to study the impact of inexactly solving the variational problems (9) or (11). Specifically, how do errors propagate to the conditioned dynamical system driven by the SDE (6) when an approximate solution of (9) or (11) is used? Understanding this error propagation is essential for assessing the robustness and reliability of the proposed method in practical scenarios.
Is the Optimization problem in (9) well-posed? This means, is the optimal solution unique?
In several instances, there are missing commas in the notation for the inner product (e.g., equations (9b) and (12)).
Many important mathematical details and assumptions are missing. Most importantly, what are the assumptions on the drift vector field $b_t$ to ensure the well-posedness of the proposed scheme?
Regarding Section 3.2.2: Instead of introducing $\xi_{\min}$ I suppose that one could also directly work with a pseudoinverse when defining $G_t^{-1}$ .
In Proposition 4, should it be $u_{t|0,T}^{(k)}$ on the right hand side of the equation for $u$ ?
When referring to probability density functions, it's important to use proper notation. Instead of $\rho(x_t=x)$ , it would be clearer to use $\rho_t(x)$ depending on the context.
In the abstract, you claim that no "inefficient" importance sampling estimators are required. However, it would be beneficial to compare this approach with such estimators to demonstrate its advantages more clearly.

局限性

There is no discussion about error propagation arising due to the Gaussian parametrization.
The approach is limited to conditioning the sample path on point sets $x_T = B$ .
Limited statistical evidence is presented in the numerical experiments.

评论- References

2024-08-07

[1] Blei, Kucukelbir, McAuliffe. “Variational Inference: A Review for Statisticians”, 2016.

[2] Jónsson, H., Mills, G. and Jacobsen, K.W., 1998. Nudged elastic band method for finding minimum energy paths of transitions. In Classical and quantum dynamics in condensed phase simulations (pp. 385-404).

[3] Weinan, E., Ren, W. and Vanden-Eijnden, E., 2004. Minimum action method for the study of rare events. Communications on pure and applied mathematics, 57(5), pp.637-656.

[4] C. Dellago, P. G. Bolhuis, and P. L. Geissler, 2006, “Transition Path Sampling Methods,” Computer Simulations in Condensed Matter Systems: From Materials to Chemical Biology Volume 1. Springer Berlin Heidelberg, pp. 349–391.

[5] Brekelmans, Rob, and Kirill Neklyudov, 2023, "On Schrödinger Bridge Matching and Expectation Maximization." In NeurIPS 2023 Workshop Optimal Transport and Machine Learning.

[6] B. Jamison, 1975, „The Markov processes of Schrödinger“, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, Bd. 32, Nr. 4. Springer Science and Business Media LLC, S. 323–331.

[7] Holdijk, L., Du, Y., Hooft, F., Jaini, P., Ensing, B. and Welling, M., 2024. Stochastic optimal control for collective variable free sampling of molecular transition paths. Advances in Neural Information Processing Systems, 36.

作者回复

2024-08-07

Q1: The connection to Doob’s h-transform established in Theorem 1 is lost due to the Gaussian parametrization of q_t. Lack of discussion regarding the implications of this parametrization on the accuracy and validity of the method.

We highlight the potential lack of expressivity for the Gaussian parameterization through experiments in Fig. 3 and lines 283-286, where a single Gaussian parameterization fails to capture several modes of the transition path. In this case, our mixture of Gaussian parameterization (proposed in Sec. 3.2.3) is necessary.

The approach of finding an approximate solution by optimizing within a tractable family is common in variational methods [1]. Indeed, our contributions include deriving a variational objective for Doob’s h-transform in Thm 1. Analysis of the accuracy of particular variational families is difficult to characterize for general problems and is beyond the scope of this work.

Q2: The formulation of the task of solving equation (12) given q_t is misleading. Given the solution of the PDE in (12), your goal is to reconstruct u_t, which is generally an ill-posed problem.

As we point out in line 152, indeed, numerous vector fields $u_{t|0,T}$ can satisfy eq. (12) for the given densities $q_{t|0,T}$ . In Proposition 3, our goal is to simultaneously parametrize both $q_{t|0,T}$ and $u_{t|0,T}$ such that they satisfy (9b) (thus avoiding constrained optimization problem).

Q3: For instance, introducing uncertainties into the model description via the drift b_t or diffusion \Sigma_t can lead to significant challenges.

We agree that analyzing the robustness of the proposed approach against model misspecification is crucial for practical applications. However, this analysis is beyond the scope of the current work.

Q4: The claim that the approach is simulation-free is not entirely clear.

The sampling of independent $x_{t|0,T}$ for a given time $t$ is justified by parameterizing marginal densities rather than the entire path measure. Indeed, our parametrization in eq. (15) defines the parameters of the marginals that change continuously over time (the SDE that has these marginals can be obtained via Proposition 1). The optimized objective in Theorem 1 relies only on the samples from these marginals, which means that during training, no simulation is needed. When sampling paths, the drift term does not need to be evaluated, allowing for efficient sampling.

We added a pseudocode of the proposed training and inference algorithm in the rebuttal. We will attempt to clarify this further in the manuscript.

Q5: Evaluating max energy and the likelihood might not measure the accuracy of the estimated distribution of the conditioned SDE in (6).

Indeed, the likelihood on its own does not capture the distribution of sampled paths. However, transition path sampling is an extremely highdimensional problem (D=66*2*1000), and thus it is difficult to characterize accuracy of matching the full distribution. We use conventional metrics, such as describing paths by the transition state (point with the highest energy) [2,3], or estimating the likelihood of trajectories [4].

Q6: It is crucial to study the impact of inexactly solving the variational problems (9) or (11).

In this paper, we learn the target path-measure $P^*$ , which corresponds to the Doob’s h-transform. Corollary 2 shows that the optimized objective corresponds to the KL-divergence between the parameterization and the reference measure $D_{\text{KL}}(Q:P^{\text{ref}})$ . Using the Pythagorean relation (see, e.g., Theorem 3.3 in [5]), one can show $D_{\text{KL}}(Q:P^{\text{ref}})=D_{\text{KL}}(Q:P^*)+D_{\text{KL}}(P^*:P^{\text{ref}}),$ where the last term is a constant. Thus, the minimized objective is the KL-divergence between the parameterization and the target measure. Thank you for your suggestion, we will add the corresponding discussion in the final version.

Q7: Is the Optimization problem in (9) well-posed? This means, is the optimal solution unique?

As stated in Theorem 1, the optimization problem in (9) has a unique solution (see proof in Appendix A.2).

Q8: In several instances, there are missing commas in the notation for the inner product (e.g., equations (9b) and (12)).

Both equations (9b) and (12) contain the divergence operator rather than an inner product. The notation for the divergence operator is $\langle\nabla_x, \cdot\rangle = \text{div}(\cdot)$ , which is introduced in line 98.

Q9: Missing mathematical details and assumptions. Most importantly, what are the assumptions on the drift vector field b_t to ensure the well-posedness of the proposed scheme?

For the mathematical details and assumptions, we refer the reader to [6] which defines rigorously the necessary conditions for the Doob’s h-transform and the corresponding PDEs. The necessary assumptions for our result are stated at the beginning of Appendix A.2.

Q10: Instead of introducing \xi_{min} I suppose that one could also directly work with a pseudoinverse when defining G_t^-1.

We will discuss this option in the final version of the paper.

Q11: In Proposition 4, should it be u_t on the right hand side of the equation for u?

Yes, we will correct it.

Q12: Justification of the “inefficiency” of importance sampling is missing.

For example, the recent work [7], which relies on importance sampling, requires 120M energy evaluations to output a reasonable transition path.

Q13: The approach is limited to conditioning the sample path on point sets x_T = B.

In the transition path sampling, the rare event is usually represented by the point $B$ . If several rare events are given, we can run our method several times using different values of $B$ . Including conditioning on other sets is a direction of future studies.

Q14: When referring to probability density functions, it's important to use proper notation. Instead of \rho(x_t = x), it would be clearer to use \rho_t(x).

We will clarify the notation.

评论- Thanks for the response

2024-08-10

Thank you very much for the detailed response.

I might be misunderstanding the entire simulation-free approach, but don’t you sample independently at each time $t$ from the marginal distributions to generate paths of the conditioned SDEs? (Please correct me when I am misunderstanding something) While this sampling scheme might be consistent under certain assumptions on the drift and diffusion coefficients, it certainly isn’t universally applicable. Generally, how do you ensure the continuity of the sampled paths? This concern remains relevant regardless of whether you employ Gaussian or Gaussian mixture parameterizations.

I would be less critical of my concern if there were clear empirical experiments to support the correct sampling distribution. In particular, the synthetic data example presents an opportunity to conduct a statistically robust case study.

Additionally, I remain concerned about the lack of explicit mathematical assumptions and the challenges associated with conditioning on point sets. Doesn’t it impose technical challenges when applying Doob‘s h-transform for conditioning on null sets? For example, the application of Jamison (1975) requires a positive function h (as also noted in line 514).

While I find the proposed framework very promising, I am maintaining my score due to the concerns I've outlined.

评论- Clarifying Misunderstanding of Test vs. Train Sampling

2024-08-11

Testing vs. Training

I might be misunderstanding the entire simulation-free approach, but don’t you sample independently at each time 𝑡 from the marginal distributions to generate paths of the conditioned SDEs?

Simulation-free refers to our training method, where our objective in Thm 1 only requires samples from the time-marginals $q_{t|0,1}$ of the conditioned SDE. This justifies neglecting full trajectory information and sampling directly from $q_{t|0,1}$ (without SDE simulation) during training.

See Algorithm 1 of the general-response PDF or cell 20 of the anonymized code.

… don’t you sample independently at each time $𝑡$ from the marginal distributions to generate paths of the conditioned SDEs? Generally, how do you ensure the continuity of the sampled paths?

We do not sample each marginal independently to generate paths of the conditioned SDE. Instead, our Algorithm 2 in the general-response PDF outlines our sampling approach.

For a given set of time marginals $q_{t|0,1}$ , we can calculate the appropriate drift $u_{t|0,1}$ for an SDE with given diffusion coefficients. At generation (test) time, we simulate this SDE: $dx_{t|0,1} = u_{t|0,1}(x_{t|0,1}) dt + \Xi_t dW_t$ starting from $x_0 \sim \mathcal{N}(A, \sigma_{\text{min}} \mathbb{I})$ . Thus, our sampled paths are continuous (up to the SDE solver error).

Note that $u_{t|0,1}$ implies a control drift term $v_{t|0,1}$ (in Eq. 10) as in Eq. 14, but we do not need to evaluate the costly drift term $b_t$ at generation time.

See cells 24-25 of anonymized code.

(Please correct me when I am misunderstanding something)

Thank you for pinpointing this misunderstanding! We hope that the above points help to address these concerns.

Based on your feedback, we agree that the following changes will greatly improve the final manuscript:

Discussion of the sampling procedure (and Algorithm box) will be included in the main text
Distinctions between the time marginals $q_{t|0,1}$ of the conditioned process and the full path measure or SDE will be carefully emphasized throughout.
We will visualize individual trajectories in Figure 2 to show that the paths are continuous and similar to MCMC trajectories.
We will modify line 47 to “our training method is simulation-free”. It appears that all other usage of “simulation-free” explicitly refers to training or optimization, rather than generation.

Evaluating Sampling Trajectories

I would be less critical of my concern if there were clear empirical experiments to support the correct sampling distribution.

Thank you for this suggestion. For the Müller-Brown experiment in Figure 2, we report Wasserstein-1 distances between (i) samples from expensive MCMC simulation (which we treat as ground-truth) and (ii) sample trajectories generated using Algorithm 2 with our learned model. We report the mean and std of the W1 distance across discrete $t$ :

Wasserstein W1	Value
Mean	0.1251
Std	0.0392
Median	0.1130
Min	0.0393
Max	0.2115

We will report further numbers and plots in the revised version of the manuscript.

Conditioning on Dirac Deltas

…challenges associated with conditioning on point sets. Doesn’t it impose technical challenges when applying Doob‘s h-transform for conditioning on null sets?

In the case of the conditioning on a point mass $\delta(x-B)$ , the h-function becomes a density, i.e. $h(y,t) := \rho(x_T = B | x_t = y)$ is the density of the transition probability $\mathbb{P}(x_T \in dx| x_t = y)$ . Conditioning on point sets is commonly used [1 Thm 7.11,2,3], and indeed, all the derivations in Appendix A hold if we take $h(y,t):=\rho(x_T = B | x_t = y)$ . We apologize for the confusion, we will clarify it in the next version.

[1] Särkkä, Simo, and Arno Solin. Applied stochastic differential equations. Vol. 10. Cambridge University Press, 2019.

[2] Heng, J., De Bortoli, V., Doucet, A. and Thornton, J., 2021. Simulating diffusion bridges with score matching. arXiv preprint arXiv:2111.07243.

[3] Liu, Xingchao, Lemeng Wu, Mao Ye, and Qiang Liu. "Let us build bridges: Understanding and extending diffusion generative models." arXiv preprint arXiv:2208.14699 (2022).

评论- Raising the score

2024-08-12

Thank you for your prompt response and thorough clarification of my concerns. I am happy to raise my score to 6. I did not give a higher score because I believe further effort is needed to enhance the presentation of the proposed framework by incorporating the points discussed in the rebuttal.

审稿意见

评分: 7置信度: 32024-07-13

This paper proposes a variational formulation of the Doob's h-transform to change the problem from the expensive simulations of trajectories to an optimization problem over possible trajectories from given initial point and end point. The model parameterization introduced imposes the desired boundary conditions by design and uses a mixture of Gaussian as the variational family. The variational formulation offers an alternative to expensive simulations and MCMC which is hard to scale. The proposed framework is applied on a simulated case study and real world problem in material science/computational chemistry of transition path sampling which is a study of molecule transitions betweeb local energy minima and metastable states under random fluctuations. The results are along expected lines where it is possible to scale the solution on computationally intractable problems with MCMC and other sampling based methods.

优点

The paper is well written, with sound math, a lot of references on recent work and older literature on the subject. I found the introduction to be a great summary of the work.
The idea is sound and through the use of variational framework to solve a complex problem of estimating rare event probabilties by sampling forward trajectories brings out the desired qualities of sample efficiency, reducing search space and matching a complex target distribution with an easier to sample variational distribution. As also observed in Table1 and 2, MCMC cannot be scaled to certain large scale experiments where Variational inference can give good results.
The work finds application for solving transition path sampling problem which is important in material science and chemistry from what it seems(I am not so well acquanited with those domains.) The case study on protein is well documented and explained.
The figures and illustrations are clean and support the narrative.

缺点

Minor things

Increase font sixe for Table1, bold the important results.
I would have liked to see a bit more dicussion on the dimensionality, what are the typical values of D ..

问题

I am curious, similar to Figure 3, the authors can show an experiment with familiy of dtistributions other than Gaussian. How does the expressivity then come to play ?
as for any SDE problem solution, what is the effect of discretization values and parameters in practice as given in line 184-185.
How big is the constraint of fixed length(T) on transition paths in practice as solved in this method ?
What is the behaviour of the problem when there are multiple basins of attraction ?

局限性

The authors have addressed the challenges quite well and what the future research directions could be taken.

作者回复

2024-08-07

Q1: I would have liked to see a bit more discussion on the dimensionality, what are the typical values of D ..

Thank you for your valuable feedback. We will discuss this in the next revision of the manuscript. The dimension $D$ depends on the specific system being modeled. For instance, alanine dipeptide with its $22$ atoms results in $D = 22 \times 3 = 66$ . If modeled with second-order dynamics, $D$ doubles to $132$ . Chignolin has $166$ atoms and results in $D = 996$ in our experiments. Large molecular systems, such as proteins with explicit solvent environments, can exceed $10,000$ atoms, with some even reaching $100,000$ atoms. As such, TPS approaches must accommodate these high-dimensional systems.

Q2: I am curious, similar to Figure 3, the authors can show an experiment with familiy of dtistributions other than Gaussian. How does the expressivity then come to play ?

Note that the crucial restriction on the family of marginals is the availability of the analytic form of the vector field satisfying the Fokker-Planck equation in order to avoid the min-max formulation in Corollary 1. We conducted the experiments using Corrolary 1 in toy settings but found it unstable when scaled to real-world systems.

Q3: as for any SDE problem solution, what is the effect of discretization values and parameters in practice as given in line 184-185.

We do not discretize time during training, the time variable is drawn uniformly $t \in [0,T]$ in order to estimate the time-integral. We will clarify this in the final version of the manuscript. These training details can also be seen in the training algorithm uploaded in the rebuttal PDF.

Q4: How big is the constraint of fixed length(T) on transition paths in practice as solved in this method?

In general, there are ways to estimate the transition time $T$ [1,2,3], and this problem is much easier than sampling the path itself. However, we consider fixed trajectory length to be one of the limitations of our algorithm and a direction of future work.

Q5: What is the behaviour of the problem when there are multiple basins of attraction?

Our algorithm is designed to solve the concrete problem statement that trajectories end in a particular state $B$ . However, due to the stochastic nature of the dynamics, there are multiple pathways to reach B. The most vivid example is presented in Figure 3, where we consider a symmetric potential, and the sampled paths follow two different ways. Similarly, in the Müller-Brown experiment, despite local minima attracting points, transitions still end in the target state. This illustrates that the algorithm can handle multiple basins of attraction by sampling diverse paths that all converge to the designated end state.

Q6: Increase font size for Table1, bold the important results.

We will improve the formatting of the tables and the presentation of our results in the revised manuscript.

References

[1] G. H. Taumoefolau and R. B. Best, 2021, “Estimating transition path times and shapes from single-molecule photon trajectories: A simulation analysis,” The Journal of Chemical Physics, vol. 154, no. 11. AIP Publishing.

[2] H. S. Chung, J. M. Louis, and W. A. Eaton, 2009, “Experimental determination of upper bound for transition path times in protein folding from single-molecule photon-by-photon trajectories,” Proceedings of the National Academy of Sciences, vol. 106, no. 29. Proceedings of the National Academy of Sciences, pp. 11837–11844..

[3] E. Suárez, J. L. Adelman, and D. M. Zuckerman, 2016, “Accurate Estimation of Protein Folding and Unfolding Times: Beyond Markov State Models,” Journal of Chemical Theory and Computation, vol. 12, no. 8. American Chemical Society (ACS), pp. 3473–3481.

作者回复

2024-08-07

We would like to thank all the reviewers for their diligent review and valuable comments, which helped us to improve the manuscript. In this general response, we would like to address shared concerns raised by more than one reviewer.

To make the implementation and training procedure for our method clearer, we have included pseudocode for the training and inference processes in the revised manuscript. We hope that this addition makes the approach more accessible and easier to follow. We have also uploaded this information as a PDF in the rebuttal.

Further, some reviewers expressed interest in the training behavior (e.g., loss, training curves) of our model. To address this, we have included additional plots showcasing the model's performance in the manuscript and have uploaded these as a PDF in the rebuttal. The figures provided are:

Loss vs training iterations.
Transition states (i.e., maximum energy) of newly sampled paths vs number of energy evaluations.

We are providing individual responses to the questions raised by each reviewer.

最终决定Accept (spotlight)

2024-09-25

This is a strong manuscript that proposes a novel variational technique for Monte Carlo sampling of random dynamical systems which is justified theoretically as well as experimentally. The reviewers are unanimous in its technical merit both conceptually and in practice. After the authors make a few improvements to the manuscript in terms of presentation and references, it will be an excellent addition to the NeurIPS conference program.