5.2

/10

Rejected5 位审稿人

最低3最高8标准差1.6

4.2

置信度

正确性2.2

贡献度2.0

表达3.0

ICLR 2025

MF-LAL: Drug Compound Generation Using Multi-Fidelity Latent Space Active Learning

Peter Eckmann,Dongxia Wu,Germano Heinzelmann,Michael K Gilson,Rose Yu

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

drug discoverymulti-fidelity learninggenerative models

评审与讨论

审稿意见

评分: 5置信度: 52024-10-27

This paper proposes a multi-fidelity method that integrates the generative modeling of the molecules and the surrogate modeling of the oracles and aims at a trade-off between the docking oracle that is of low fidelity but of cheap computational cost and binding free energy oracle that is of high fidelity but of expensive computational cost. Besides, the author proposes a new criterion during active learning for generating candidate query molecules to the oracles.

优点

A new criterion for candidate molecule generation during active learning is proposed.
The author implements diverse baseline methods for comparison with the proposed one and considers multiple sources of oracles.
Implementation code is provided as proof of reproducibility.

缺点

The soundness and novelty of the main feature of ML-LAL - " using separate latent spaces and decoders specialized for each fidelity improves surrogate modeling and inter-fidelity information passing"- is questionable. The detailed explanation of my concerns is as follows:
- The author use a set of random variables $z_{1:K}$ $z_{1 : K}$ , corresponding to $K$ $K$ fidelity, as the latent representation of molecule $x$ $x$ . According to the main text, the variational distribution $q(z_{1:K}|x)$ $q (z_{1 : K} ∣ x)$ (that approximate true posterior $p(z_{1:K}|x)$ $p (z_{1 : K} ∣ x)$ ) is defined by, $q(z_{k}|z_{k-1})=\mathcal{N}(\mu_k(z_{k-1}),\sigma_k(z_{k-1}))$ $q (z_{k} ∣ z_{k - 1}) = N (μ_{k} (z_{k - 1}), σ_{k} (z_{k - 1}))$ where $z_0=x$ $z_{0} = x$ and $k=1,\dots,K$ $k = 1, \dots, K$ . Enhancing the expressive capbility of $\mu_k$ $μ_{k}$ and $\sigma_k$ $σ_{k}$ by choosing them the outputs of neural network $h_k$ $h_{k}$ , the author expects better approximation of $p(z_{1:K}|x)$ $p (z_{1 : K} ∣ x)$ , thereby so-called "better inter-fidelity information passing".
  - Defining $F_k(\cdot)=\mu_k(\cdot)+\sqrt{\sigma(\cdot)} \epsilon_k$ with $\epsilon_k\sim \mathcal{N}(0,1)$ and $F'(F(\cdot))=(F'\circ F)(\cdot)$ , then $z_k$ in this paper is represented by $z_k=\mu((F_{k-1}\circ \ldots \circ F_1)(x))+\sqrt{\sigma((F_{k-1}\circ \ldots \circ F_1)(x))} \epsilon_k$ .
- However, we can alternatively represent $z_k$ $z_{k}$ as $z_k=\mu((h_{k-1}\circ \ldots \circ h_1)(x))+\sqrt{\sigma((h_{k-1}\circ \ldots \circ h_1)(x))} \epsilon_k$ $z_{k} = μ ((h_{k - 1} \circ \dots \circ h_{1}) (x)) + σ ((h_{k - 1} \circ \dots \circ h_{1}) (x)) ϵ_{k}$ ( $h' \circ h$ $h^{'} \circ h$ means stacking the two neural networks). This way, I believe, also enhances the expressive capability of $\mu_k$ $μ_{k}$ and $\sigma_k$ $σ_{k}$ .
  - Since $\epsilon_k$ is non-informative, the issue here is that the method considered in this paper, and the method I provide above are conceptually similar, and the latter one means nothing but adding more neural network layers to $\mu_k$ and $\sigma_k$ as $k$ increases, rendering the novelty of the proposed method trivial, and the claimed ''better message passing'' questionable.
Following the setup of $q(z_k|x)$ , the author propose algorithm 2 for efficiently generation of molecules candidates, which is expected to be of high scores for each fidelity. However, if we follows the conceptually similar setup I described above, algorithm 2 just reduces to getting a sample batch $\\{\hat{\epsilon}^{(i)}\\}$ from $\mathcal{N}(0,1)$ , and only keeping the samples $\hat{\epsilon}$ such that ${z_k}(\hat{\epsilon})$ is of high scores for each $k$ .
The experiment results are not convincing. The detailed reason is as follows:
- As acknowledged by the author, the goal of multi-fidelity methods is to balance the trade-off between fidelity and computation cost under a fixed computation budget (lines 69-70, algorithm 1). In this paper, the first oracle, docking activity, is computationally inexpensive but has low fidelity, whereas the second oracle, binding free energy, is computationally expensive but offers higher fidelity.
- In this sense, we would like to see a training curve as done in the baseline method[1], where the $y$ -axis represents the values of performance metrics, and $x$ -axis represents the computation cost until the budget is exceeded.
- However, the author only provides Table 1 in this paper, which summarizes the final performance metric values. This alone is insufficient to support the claimed capability for the trade-off.
The other claimed feature- "MF-LAL combines the generative and multi-fidelity surrogate models into a single framework(lines 20-21)"- is questionable.
- While the author claims that workflow in [1] is separate (lines 56-58), we note that the integrated workflow of MF-LAL is similar to [1]. The integrated part, which I found in MF-LAL is that in equation (1), the first term (corresponding to generative part) and the second term (corresponding to surrogate part) are optimized jointly.
- However, since the two terms are sequential, where the output of generation part is the input of the surrogate part, there is no big difference whether we optimize it sequentially as described in [1] or jointly as in MF-LAL.
- By the way, GPs are non-parametric models[3], and $\lambda_k$ in line 205 is the hyper-parameters of the mean or covariance function. Thus, the second term does not exist when there are no hyper-parameters or hyper-parameters are fixed. The surrogate modeling of the oracle is commonly referred to as doing posterior inference of $P(f|x,\mathcal{D})$ , where $\mathcal{D}$ is the multi-fidelity dataset, rather than fitting the hyper-parameters[4].
Equation (1) is incorrect. ELBO (Evidence Lower Bound Objective) is the lower bound of log-likelihood of observations, this is $\log p(x)$ . So, for example, the first term for $k=1$ should be $\log p(x)=\log E_{z_1}[\frac{P(x|z_1)P(z_1)}{q(z_1|x)}]\geq E_{z_1}[\log \frac{P(x|z_1)P(z_1)}{q(z_1|x)}]$ rather than $E_{z_1}[\log \frac{P(x|z_1)}{q(z_1|x)}]$ . Besides, if $f_k$ in the second term, is a GP (Gaussian Process), what is the definition of $p(y|f_k)$ ? (Is $y$ equal to the output of a GP plus additional Gaussian noise? )
According the Algorithm 1, the hyper-parameter $\gamma_k$ is crucial as it dictates whether queries are made at the next level of fidelity. Therefore, an ablation study or discussion for how $\gamma_k$ are selected should be provided.

[1] Alex Hernandez-Garcia, Nikita Saxena, Moksh Jain, Cheng-Hao Liu, and Yoshua Bengio. Multifidelity active learning with gflownets. arXiv preprint arXiv:2306.11715, 2023.

[2] Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multifidelity multi-armed bandit. Advances in neural information processing systems, 29, 2016.

[3] Seeger M. Gaussian processes for machine learning. International journal of neural systems. 2004 Apr;14(02):69-106.

[4] Frazier PI. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811. 2018 Jul 8.

问题

The abstract can be improved. There is a lack of information about "what is replaced by a surrogate model" ( oracles?) or "why a surrogate model is used".
There is no definitions or intuitive explanations about what is reverse optimization. This term seems to be important in the main text.
In algorithm 2, $z_k^{(i)}$ are samples of random variables. How can samples be further optimized by gradient-descent?

2024-11-21

We thank the reviewer for their useful feedback.

The soundness and novelty of the main feature of ML-LAL - " using separate latent spaces and decoders specialized for each fidelity improves surrogate modeling and inter-fidelity information passing"- is questionable … the method considered in this paper, and the method I provide above are conceptually similar, and the latter one means nothing but adding more neural network layers to $\mu_k$ and $\sigma_k$ as $k$ increases, rendering the novelty of the proposed method trivial, and the claimed ''better message passing'' questionable.

In the case where we only sample from each latent space $z_{1, \dots, K-1}$ to predict the mean and variance in the subsequent latent space, then the process can be simplified by using a single network to predict the mean and variance at the final latent space and then adding noise. However, the surrogate models $\hat{f_{1}}, \dots, \hat{f_{K}}$ are the key components of our architecture that make sampling from each latent space separately necessary. As the surrogate model for each fidelity takes points $z_k^{(i)}$ in the associated latent space as input, it is necessary that we define each of these latent spaces separately and can sample from each individually. Additionally, the information passing between latent spaces (using $h_{\xi_k}$ ) is possible because each surrogate model $\hat{f}_k$ organizes its associated latent space $z_k$ to best aid in prediction at that fidelity level. Thus, it is necessary to retain each individual latent space and have separate neural networks to pass information between the latent spaces, or the benefits of the separated surrogate models cannot be realized.

...if we follows the conceptually similar setup I described above, algorithm 2 just reduces to getting a sample batch $\{ \hat{\epsilon}^{(i)} \}$ from $\mathcal{N}(0,1)$ , and only keeping the samples $\hat{\epsilon}$ such that $z_k(\hat{\epsilon})$ is of high scores for each $k$ .

The key difference between the reviewer’s approach and our proposed approach is that we perform optimization of compound scores at each fidelity level. This means we use a gradient-based optimizer to find locations in each latent space that have high surrogate-predicted scores. In contrast, the reviewer’s approach consists of sampling points in latent space $z_k$ and checking each one to see if it has high scores at all other fidelities. This sampling approach will not generate compounds with as good scores as the optimization-based technique. This is because the optimization-based technique takes advantage of knowing the gradient of the surrogate model, as opposed to the sampling technique that randomly guesses until it finds a good compound.

we would like to see a training curve as done in the baseline method[1], where the $y$ -axis represents the values of performance metrics, and $x$ -axis represents the computation cost until the budget is exceeded … However, the author only provides Table 1 in this paper, which summarizes the final performance metric values. This alone is insufficient to support the claimed capability for the trade-off.

We would like to emphasize Appendix C.1, which contains a training curve similar to the reviewer’s description. The plot shows the oracle-predicted scores of the generated compounds during the active learning process, and as expected the scores become more favorable as active learning proceeds. We would be happy to provide additional data if the reviewer has specific requests.

While the author claims that workflow in [1] is separate (lines 56-58), we note that the integrated workflow of MF-LAL is similar to [1]. The integrated part, which I found in MF-LAL is that in equation (1), the first term (corresponding to generative part) and the second term (corresponding to surrogate part) are optimized jointly. However, since the two terms are sequential, where the output of generation part is the input of the surrogate part, there is no big difference whether we optimize it sequentially as described in [1] or jointly as in MF-LAL.

The generative model in [1] does not explicitly model the different fidelity levels. Instead, it is a standard generative model that optimizes the acquisition function of a separate multi-fidelity surrogate. The generative model only optimizes a single scalar acquisition value, and is thus not aware of the multi-fidelity nature of the generation problem. In contrast, our model is integrated in that it explicitly models each fidelity inside of the generative model. This means the generative model is aware of the differences between fidelities, and is able to generate compounds tailored to each specific fidelity level.

(response continues below)

2024-11-21

GPs are non-parametric models[3], and $\lambda_k$ in line 205 is the hyper-parameters of the mean or covariance function. Thus, the second term does not exist when there are no hyper-parameters or hyper-parameters are fixed. The surrogate modeling of the oracle is commonly referred to as doing posterior inference … rather than fitting the hyper-parameters[4].

We have updated the text to clarify that $\lambda_k$ refers to the GP hyperparameters. While “performing posterior variance” may be more accurate in the case of vanilla GPs, we use SVGPs, so we believe it is more accurate to say we are “fitting the hyperparameters” (as we use an optimizer to learn these hyperparameters along with the rest of MF-LAL’s parameters).

Equation (1) is incorrect…

We thank the reviewer for pointing out the typo. The first term is indeed not the ELBO, but instead a term that is equivalent to maximizing the ELBO. We have updated the draft to clarify this. In the second term, $y$ refers to the ground-truth oracle output, not the GP output.

According the Algorithm 1, the hyper-parameter $\gamma_k$ is crucial as it dictates whether queries are made at the next level of fidelity. Therefore, an ablation study or discussion for how $\gamma_k$ are selected should be provided.

$\gamma_k$ is chosen along with the other hyperparameters using a hyperparameter search (described in Appendix B). We found that $\gamma_k = 0.1$ for all $k$ worked well, chosen from the range $[0, 1]$ during the hyperparameter search. We have added text to the updated draft to clarify this.

The abstract can be improved. There is a lack of information about "what is replaced by a surrogate model" ( oracles?) or "why a surrogate model is used".

We have updated the abstract to clarify these issues.

There is no definitions or intuitive explanations about what is reverse optimization. This term seems to be important in the main text.

Reverse optimization refers to using the gradient of the surrogate model to help search for latent points that have high surrogate-predicted properties. This is done using a standard gradient-based optimizer. We have updated the draft to clarify this.

In algorithm 2, $z_k^{(i)}$ are samples of random variables. How can samples be further optimized by gradient-descent?

We intended for $z_k^{(i)} \sim \mathcal{N}(0, I)$ to mean we initialize $z_k^{(i)}$ with a random starting point and then perform optimization starting from that point. We have updated the draft to better clarify this.

2024-11-25

Equation (1) is incorrect…

Equation (1) is still incorrect. Optimization e.q. (1) is not equivalence to optimization of ELBO: For $k=1$ , ELBO is equal to $\log E_{q(z_1|x;\theta)}[\log \frac{P(x|z_1)P(z_1)}{q(z_1|x;\theta)}]$ , e.q.(1) is $E_{q(z_1|x;\theta)}[\log \frac{P(x|z_1)}{q(z_1|x;\theta)}]$ , and the subtraction is $\log E_{q(z_1|x;\theta)}[\log P(z_1)]$ . It is clear that the subtraction is not neglectable since it depends on the variational distribution (the neural network model).

According to the Algorithm 1, the hyper-parameter is crucial a's it dictates whether queries are made at the next level of fidelity. Therefore, an ablation study or discussion of how are selected should be provided.

we would like to see experiment results under different setups of $\lambda$ (i.e. ablation study) rather than simply report a single best choice.

2024-11-28

Thank you for your response! We would just like to respond to 2 of your comments:

My concern is there seems to be no advantage of random latent spaces, since these random latent spaces are created from $\mathcal{N}(0,1)$ and does not integrated further information to help modeling. Theoretical analysis or empirical evidence should be provided to support your claims.

The latent spaces contain information to help modeling because we jointly train the encoders and surrogate models. The latent spaces become "organized" to aid in prediction at each level, which means the molecules in the latent space are ordered according to their property values. This property has already been reported in the literature [1], and is the reason the latent spaces integrate information to help in modeling.

[1] Tevosyan et al. "Improving VAE based molecular representations for compound property prediction." Journal of Cheminformatics 2022.

Therefore, as done in Figure 2 of [1], the best multi-fidelity model is the one that achieves the best performance metrics given fixed cost or the lowest cost given fixed metric thresholds.

We agree that the best way to measure the performance of multi-fidelity models is evaluating them after a fixed computational budget, which is why our primary experiments reported in Table 1 show the ABFE values of the generated compounds after expending a fixed budget. This is equivalent to taking a vertical slice of Figure 2 in [1]. We could not evaluate the performance at additional budget values due to computational cost. However, because performance at a given budget is proportional to the performance at other budgets, we still believe our results are valuable.

2024-11-28

Thanks. Since my concerns are not addressed, I decide to keep my current score.

2024-11-25

Thanks for your feedback. After carefully going through your responses, I think my concerns are still not addressed. I am sorry but have to maintain my overall rating. The detailed reasons are as follows:

The soundness and novelty of the main feature of ML-LAL - " using separate latent spaces and decoders specialized for each fidelity improves surrogate modeling and inter-fidelity information passing"- is questionable … the method considered in this paper, and the method I provide above is conceptually similar, and the latter one means nothing but adding more neural network layers to $\mu_k$ and $\sigma_k$ as $k$ increases, rendering the novelty of the proposed method trivial, and the claimed ''better message passing'' questionable.

The alternative way I provide also retains separate latent spaces, actually make individually sampling easier.

The two methods just differ in that: For $z_k$ that passes $k$ neural layers, your method adds noise after passing each layer, so add noise for $k$ times. There are $k-1$ random latent spaces, and one random observed space. For $z_{k+1}$ , noise are added $k+1$ times. There are $k$ random latent spaces, and one random observed space. For $k+2$ .......
I just pass all deterministic layers, and add noise whenever I need to do sampling. For $z_k$ , there are $k-1$ deterministic latent space and one random observed space. For $z_{k+1}$ , there are $k$ deterministic latent space and one random observed space. For $k+2$ ........
Actually, maintaing deterministic space is more efficient and easier to sample them seperately. We just need to add noise to the output of $k_{th}$ neural layer to get sample of $z_k$ . By contrast, if maintaining random space, to get sample of $z_k$ , we must first sample $z_1$ , $z_2$ ....., until $z_{k-1}$ .

My concern is there seems to be no advantage of random latent spaces, since these random latent spaces are created from $\mathcal{N}(0,1)$ and does not integrated further information to help modeling. Theoretical analysis or empirical evidence should be provided to support your claims.

...if we follows the conceptually similar setup I described above, algorithm 2 just reduces to getting a sample batch $\hat{\epsilon}^{i}$ from $\mathcal{N}(0,1)$ , and only keeping the samples $\hat{\epsilon}$ such that is of high scores for each $z(\hat{\epsilon})$ .

If your samples $z_k=\mu((F_{k-1}\circ \ldots \circ F_1)(x))+\sqrt{\sigma((F_{k-1}\circ \ldots \circ F_1)(x))} \epsilon_k$ can be further optimized w.r.t. surrogates by gradient descent, then it is clear that $z_k=\mu((h_{k-1}\circ \ldots \circ h_1)(x))+\sqrt{\sigma((h_{k-1}\circ \ldots \circ h_1)(x))} \epsilon_k$ can also be optimized.

we would like to see a training curve as done in the baseline method[1], where the $y$ -axis represents the values of performance metrics, and $x$ -axis represents the computation cost until the budget is exceeded … However, the author only provides Table 1 in this paper, which summarizes the final performance metric values. This alone is insufficient to support the claimed capability for the trade-off.

As pointed out in the baseline model you are comparing with[1], it is common sense that the multi-fidelity model is designed to balance the trade-off between cost and fidelity(or accuracy). Therefore, as done in Figure 2 of [1], the best multi-fidelity model is the one that achieves the best performance metrics given fixed cost or the lowest cost given fixed metric thresholds. Otherwise, always choosing the high-fidelity source will just give you the best final performance. Your figure in Appendix C.1 just shows the cost-performance curves when always choosing fidelity 2 and 3 respectively, which can support the correctness of the overall active learning framework, but can not well support your proposed multi-fidelity method.

While the author claims that workflow in [1] is separate (lines 56-58), we note that the integrated workflow of MF-LAL is similar to [1]. The integrated part, which I found in MF-LAL is that in equation (1), the first term (corresponding to generative part) and the second term (corresponding to surrogate part) are optimized jointly. However, since the two terms are sequential, where the output of generation part is the input of the surrogate part, there is no big difference whether we optimize it sequentially as described in [1] or jointly as in MF-LAL.

I respectfully disagree your comment 'The generative model in [1] does not explicitly model the different fidelity levels'. Please check '3.3 Multi-Fidelity GFlowNets' in [1], where a fidelity space $\mathcal{M}$ is integrated into the generative DAG graph and a multi-fidelity generative policy $\pi_F(x,m)$ is used.

审稿意见

评分: 3置信度: 42024-10-28

This paper proposes a multi-fidelity active learning method to produce molecules with desirable properties, given multiple different oracles with varying costs. The method uses an encoder to embed molecules into a latent space, with stochastic variational Gaussian process (SVGP) models trained for each fidelity. The authors perform experiments on multiple protein targets.

优点

The problem studied by the authors is sensible and important to the field. The method proposed by the authors is, broadly speaking, reasonable (although I describe some weaknesses below). I think the paper is well-written and the presentation is good.

缺点

I have two main criticisms of the method itself:

A key motivation for the method is to "ensures compounds generated at higher fidelities also scored well at lower fidelities, improving the quality of generated samples." (lines 075-077). This hypothesis seems questionable: why would one want this? The fundamental challenge in multi-fidelity AL is that the observations between different fidelities are not very well-correlated. Is it not possible that most interesting molecules at high-fidelity score poorly at lower fidelities?
The method uses SVGPs in quite a few places, which raises concerns about robustness. SVGPs are very sensitive to the location of their inducing points, and tend to overestimate variance away from their inducing points (see https://arxiv.org/abs/1606.04820). This means:

Line 7 in algorithm 1 may not be implemented accurately if the variance is overestimated
The distribution of the initial dataset will likely determine the location of the inducing points, and with the UCB objective it seems plausible that nothing away from the initial distribution will therefore be chosen. At the very least, the behavior will depend enormously on the locations of the inducing points.

Aside from this, I thought the experiments had some significant correctness issues. First and foremost, the results in Table 1 are described as "significant" despite a very small sample size of 15 and considerable variance in the scores of the generated molecules. I think the authors need to apply a statistical test to their results so that the differences between methods can be evaluated more objectively. I understand that the expensive nature of the computations requires a small sample size, but this is exactly the scenario where statistical tests are helpful.

Second, I think the authors are missing a "standard MF-AL" method as a baseline- the most "standard" baselines use VAE-GP models which may not fit the data well. I recommend either an autoregressive GP (ie prior mean for fidelity $k$ is the posterior mean for fidelity $k-1$ ) or a "co-kriging" model where a single GP has a kernel between inputs and fidelities. EI or UCB could be used as an acquisition function, and the kernel could either be a Matern 5/2 or Tanimoto kernel between molecule fingerprints.

Finally, more broadly the sensitivity of the method to hyperparameters is not discussed or explored. My guess is that the performance is extremely sensitive to parameters like UCB $\beta$ or the locations of the inducing points, and this may prevent the method from being applied successfully in other problems. It would be good to see this discussed more.

问题

Can you describe the method's sensitivity to hyperparameters?
Can you please explain why one would want molecules which score well at high fidelities and low fidelities, instead of only caring about high fidelities?
Can you comment on the statistical significance of your results?

2024-11-21

We thank the reviewer for their useful feedback.

I thought the experiments had some significant correctness issues. First and foremost, the results in Table 1 are described as "significant" despite a very small sample size of 15 and considerable variance in the scores of the generated molecules. I think the authors need to apply a statistical test to their results so that the differences between methods can be evaluated more objectively.

We performed evaluations on 15 compounds due to the high computational cost of ABFE, as evaluating the compounds from a single method takes 9.33 hours * 15 = 5.8 days. Nonetheless, we have now evaluated an additional 15 compounds (30 in total) from MF-LAL and only the single most competitive baseline for each target. These new results have been added to Table 1 in the updated draft. We found that MF-LAL still generates compounds with better ABFE scores than baselines for both targets. To evaluate the statistical significance of our results, we measured the p-value between the mean ABFE score of generated compounds from MF-LAL and the most competitive baseline. For BRD4(2), we computed p=0.16, and for c-MET, p=0.001. At least for c-MET, MF-LAL generates compounds with a significantly better mean ABFE score than the most competitive baseline. For BRD4(2), we approach but do not reach significance, likely because the spread of ABFE results appear to be larger for that target.

A key motivation for the method is to "ensures compounds generated at higher fidelities also scored well at lower fidelities, improving the quality of generated samples." (lines 075-077). This hypothesis seems questionable: why would one want this? The fundamental challenge in multi-fidelity AL is that the observations between different fidelities are not very well-correlated. Is it not possible that most interesting molecules at high-fidelity score poorly at lower fidelities?

Our results show that even the lowest fidelity oracle is somewhat accurate at distinguishing between active and inactive compounds (see the results in Appendix B.1 in the updated draft). Therefore, compounds that score well at the lowest fidelity are more likely to show binding at the highest fidelity than randomly chosen molecules. By only searching in the space of compounds at higher fidelities than scored well at the lower fidelities, we focus our search and greatly reduce the computational cost needed to find good compounds. While it is possible that some interesting molecules at the higher fidelities score poorly at the lower fidelities, that is not necessarily a concern for our method. Since there are many potential scaffolds that can score well at the highest fidelity, it is a worthwhile tradeoff to discard some of these scaffolds early on with lower fidelity oracles in order to decrease the computational cost at the highest fidelity and make the search tractable. We have added discussion of this point to the Limitations section of the updated draft.

The method uses SVGPs in quite a few places, which raises concerns about robustness. SVGPs are very sensitive to the location of their inducing points, and tend to overestimate variance away from their inducing points

We thank the reviewer for raising this potential concern. We have added discussion about the variance of SVGPs to the Limitations section of the updated draft. We note that while the high variance of SVGPs away from the inducing points might bias the UCB acquisition function towards molecules far away from the training distribution, we also include an L2 regularizer in the acquisition function that encourages generated compounds to stay in-distribution in the latent space. Empirically, we see that the acquisition function generally chooses molecules that fall within the training distribution, suggesting that the high variance of SVGPs is not a major concern for MF-LAL.

The distribution of the initial dataset will likely determine the location of the inducing points, and with the UCB objective it seems plausible that nothing away from the initial distribution will therefore be chosen. At the very least, the behavior will depend enormously on the locations of the inducing points.

Instead of using the initial dataset to choose the inducing points, we randomly choose 1,000 inducing points by sampling from the Gaussian prior of the latent space. This means the inducing points are independent of the initial data.

(response continues below)

2024-11-21

I think the authors are missing a "standard MF-AL" method as a baseline- the most "standard" baselines use VAE-GP models which may not fit the data well. I recommend either an autoregressive GP (ie prior mean for fidelity $k$ is the posterior mean for fidelity $k-1$ ) or a "co-kriging" model where a single GP has a kernel between inputs and fidelities. EI or UCB could be used as an acquisition function, and the kernel could either be a Matern 5/2 or Tanimoto kernel between molecule fingerprints.

We would like to emphasize our “VAE + MF-GP” baseline, which is very similar to what the reviewer proposes. We used a single co-kriging GP model with a linear truncated fidelity kernel. This is, as the reviewer requested, a single GP with a kernel between inputs and fidelities. We used the multi-fidelity max value entropy acquisition function, which is suitable for the multi-fidelity case. We would be happy to discuss further if the reviewer has suggestions for how to improve this baseline or more details on what other baseline should be considered.

Finally, more broadly the sensitivity of the method to hyperparameters is not discussed or explored. My guess is that the performance is extremely sensitive to parameters like UCB $\beta$ or the locations of the inducing points, and this may prevent the method from being applied successfully in other problems. It would be good to see this discussed more.

As stated in Appendix B, we have conducted a hyperparameter search for MF-LAL with 20 random trials using the first 3 fidelities (to make the computational cost manageable). This relatively small amount of hyperparameter searching seems to already give us strong results. We have updated the draft to further discuss this concern in Appendix B, and point out some specific hyperparameters that seem to be especially important for performance. To answer the reviewer’s specific question about $\beta$ , revisiting the hyperparameter search results suggests that $\beta$ values $\pm 0.2$ do not seem to affect performance much.

评论- Response to authors

2024-11-25

Thank you for the changes. My thoughts per topic:

Statistical significance: thank you for including this. My interpretation is that your method plausibly scores higher than other methods, but it is not clear (eg relatively high p-value for BRD4(2)). I know the high cost of experiments makes a larger test prohibitively expensive, but this is not a reason to ignore significance tests results unfortunately.
Low/high fidelity: makes sense, thanks. I think I misunderstood
SVGP: thank you for adding this discussion, although I would not say that this is "not a concern"- it means your method is systematically deviating from the ideal UCB policy by use of an approximate model. The fact that you choose inducing points via random sampling and is arguably one of the worst choices possible, since the ideal inducing point locations tend to be very close to the training data points. Overall this is not an error on your part, but it does add a lot of unpredictability to the method in my opinion.
GP baselines: I would not consider VAE + MF-GP to be a suitable baseline because it uses VAE representations, where distance in VAE latent space may not correspond to a clear structural change (or at the very least, it is highly dependent on the VAE used). A baseline using fingerprint features avoids and VAE-related pathologies, which is why I suggested it as a baseline.
Hyperparameters: thank you for adding this discussion

Overall, while I appreciate the response, I don't think it has changed my overall assessment of the paper: the method still seems brittle and plausibly not better than existing alternatives. If I could change my score from 3->4 I would, but ICLR only allows a change to 5 or 6, which I don't feel are appropriate given these concerns.

2024-11-28

Thank you for your response!

Overall this is not an error on your part, but it does add a lot of unpredictability to the method in my opinion.

Thank you for clarifying the concerns with SVGPs. We have added some brief discussion of this point to the Limitations section in the updated draft.

A baseline using fingerprint features avoids and VAE-related pathologies, which is why I suggested it as a baseline.

For any baseline, the GP model would need to take points in the VAE latent space as input. This is because the way we generate molecules is by backpropagating the output of the GP into the latent space to find locations with high surrogate-predicted activities. A consequence of this is that the surrogate's mapping from latent space to activity prediction needs to be fully differentiable, so we cannot use representations like fingerprints as input, because generating them involves non-differentiable operations.

评论- Fingerprint GP *can* be a baseline

2024-11-28

For any baseline, the GP model would need to take points in the VAE latent space as input. This is because the way we generate molecules is by backpropagating the output of the GP into the latent space to find locations with high surrogate-predicted activities. A consequence of this is that the surrogate's mapping from latent space to activity prediction needs to be fully differentiable, so we cannot use representations like fingerprints as input, because generating them involves non-differentiable operations.

A "baseline" method does not need to be integrated into your method. All you need to do for a baseline BO method is maximize an acquisition function, which could be done by sampling from a VAE or other generative model, or alternatively with an evolutionary algorithm. I encourage you to think more about a baseline which does not require backpropagation, since this is not an intrinsic requirement of multi-fidelity optimization.

审稿意见

评分: 8置信度: 42024-10-31

The authors introduces an active learning generative learning framework to design novel small molecules (ligands) that are optimized to have high binding affinity to a protein target using a set of oracle functions. Their method uses an increasing hierarchy of oracles, whose output becomes more reliable with higher-fidelity oracles with the downside that the oracle's execution become more computationally expensive.
The authors propose to use an increasing set (which state the higher fidelity-spaces) of encoder-decoder architecture to embed a ligand from SELFIES-representation to a latent space, and decode the latent representation back for generation purpose. Crucially, the latent representation from the encoder(s), are suitable to train surrogate models for the multiple fidelities with the advantage that gradient-based optimization can be also performed in the latent spaces.
The authors show that their proposed method which performs active learning on hierarchical latent spaces outperforms current string-based and 3d-based generative models in their experiments on two protein target that are involved in human cancer development.

优点

The paper is well written and motivated by the current methods in the literature which mostly rely on a single oracle function such as docking, without considering higher fidelity and reliable evaluation metrics. Since the usage of higher-fidelity oracles such as computing the absolute binding free energy (ABFE) is costly, approximating those computation with neural networks as surrogate models is reasonable and brings the advantage that these surrogate models can be used for gradient-based optimization. The active learning component, usually applied in supervised learning, for querying samples from a given dataset with a costly oracle function, is here used for generation purposes where the optimization for promising ligands is done in a hierarchy of latent spaces using gradient-based methods. The authors show that the proposed loss terms for active learning, i.e., the acqusition function and likelihood terms from lower-fidelity levels are necessary for improved evaluation results, which is confirmed in their ablation studies.

缺点

The comparison to the 3d generative model Pocket2Mol is not fair since no optimization is done to sample ligands that have high binding affinity for that method. The (autoregressive) Pocket2Mol model also does not fall into the state-of-the-art for 3d generative anymore, although this method is faster in generation than current diffusion-based models. I would be interested how the 1d/SELFIES-based MF-LAL framework compares against newer 3d generative models such as DecompDiff [1] or PILOT [2] without applying guidance or calling the oracles, since their methods might not be directly applicable within the multi-fidelity hierarchy.

[1] DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design, https://proceedings.mlr.press/v202/guan23a.html.
[2] PILOT: equivariant diffusion for pocket-conditioned de novo ligand generation with multi-objective guidance via importance sampling, https://pubs.rsc.org/en/content/articlelanding/2024/sc/d4sc03523b

问题

How is the reconstruction accuracy for the 4 seperate encoder-decoder networks? The authors wrote in [Line 183-185] that "the use of a specialized decoder for each fidelity level improves reconstruction quality compared to previous methods, that only use one, thus making the generated samples more tailored for their fidelity level".

In the supplementary code base in the main.py, I saw that the weighting coefficient for the KL-prior is set quite low with ~0.08, indicating that the latent space is not really smooth and especially for the first fidelity level $z_1 \sim N(z_1 | \mu_1(x), \sigma_1(x))$ where 200,000 samples are available for training the VAE-1, "holes" in the latent space might appear. Could the author provide reconstruction learning curves over epochs? As the fidelity-level gets higher, the sparser the latent spaces $Z_k, k\geq 2$ become, since first these spaces are trained with less data. Although the authors provide the log-likelihood term of a mixture of Gaussians from previous latents, it is not clear how often a optimized latent $z_k$ can be decoded back to a valid SELFIES string.

2024-11-21

We thank the reviewer for their useful feedback.

I would be interested how the 1d/SELFIES-based MF-LAL framework compares against newer 3d generative models such as DecompDiff [1] or PILOT [2] without applying guidance or calling the oracles, since their methods might not be directly applicable within the multi-fidelity hierarchy.

Thank you for the suggestion! We have now run DecompDiff as a baseline and report the results in Table 1. Our results show that DecompDiff generally produces molecules with slightly worse ABFE scores than Pocket2Mol, and significantly worse scores than MF-LAL.

How is the reconstruction accuracy for the 4 seperate encoder-decoder networks? … Could the author provide reconstruction learning curves over epochs? As the fidelity-level gets higher, the sparser the latent spaces become, since first these spaces are trained with less data.

We found that the lower fidelity latent spaces generally have lower reconstruction accuracy than the higher fidelity latent spaces, likely because the higher fidelity latent spaces have a more limited compound space to decode (even if the training data is more sparse). However, the reconstruction accuracy is generally high across all fidelities, showing that MF-LAL successfully learned the mapping from latent space to molecule. We have added these new results measuring the reconstruction rate of each decoder during the active learning process to Appendix C.2.

2024-11-25

Thank you for running the additional experiments and comparing MF-LAL against newer 3D generative model such as DecompDiff.

Regarding the new experiment wrt. the reconstruction accuracy in Appendix C.2 - just for clarification: in the caption it is written that "the y-axis shows the proportion of training set compounds that were successfully reconstructed using the decoder". The "training set" is referred to the training set each fidelity is trained on, correct?

I acknowledge the authors performing these experiments and keep my score.

2024-11-27

Thank you for your response and positive comments!

The "training set" is referred to the training set each fidelity is trained on, correct?

Yes, that's right, the reconstruction accuracy is measured on the fidelity-specific training set.

2024-11-27

Thank you for clarification!

审稿意见

评分: 5置信度: 42024-11-01

The authors proposed a new approach to designing molecules binding to a given target protein. Conventional approaches are mostly based on docking methods which are known to be insufficient for reliable predictions of binding between molecules and a target protein. Absolute biniding free energies (ABFE) are the most accurate prection method for bining affinites, but they are too expensive to apply to many molecules. The authors resolved this problem by using a novel multi-level fidelity latent space active learning (ML-LAL). The result shows that the proposed method outperformed baseline models. The idea of ML-LAL is new but the benchmark results need to be further justified because of the use of only 15 generated molecules in the evaluation.

优点

Latent space active learning combined with generative AI to efficiently query the oracles
Multi-fidelity level latent space and decoder at each, leadting to higher accuracy
Well compromising between speed and accuracy by reusing the latent information from lower levels on higher levels.
The use of ABFE for prediction is challenging in conventional approches, but this work has used it for evaluation thanks to the relatively high efficiency of the proposed method.

缺点

Only 15 generated compounds have been tested for evaluations, which seems too few to convince conclusions.
The mean and top 3 values of the 15 generated molecules have large differences,
Training at the highest fidelity level is terminated not by the training error threshold but the total computational time. Thus, no systematic convergence for a given target can be made.

问题

Maximizing the likelihood that the molecule would also be generated at fidelity k−1 with a high property score apparently limits exploration space, and the oracle at the lowest level is likely to misguide the searching spot due to its poor accuracy. This might be the reason for the high variance (especially a large gap between mean and top 3 values) in the predicted properties of 15 generated molecules. Moreover, as shown in Table 2, the top 3 results w/o likelihood term are not significantly lower than with it. The authors need to elaborate on the potential effect of using the likelihood constraint (eq 3) on constraining the exploration.
Using only 15 generated compounds is too few to evaluate statistically meaningful performance, which casts doubt on the conclusions. In particular, the top 3 values strongly rely on sampling. The authors may evaluate with more samples or discuss the limitations.
Training the proposed model is terminated with a total time limit, which causes an unsystematic convergence criterion. As a result, the binding affinities of the top 2 and 3 of C-MET are much weaker than the top 1, whereas the top 3 values of BRD4 are in a similar range. The authors need to justify the consequences of their termination criterion (7 days) on the model performance.
The authors define a set of probabilistic decoder networks that translate a latent vector at each fidelity level to a single original molecule. This reviewer wonders about the performance of each decoder in terms of success rates in reconstructing the original molecule. Do they all give the same correct molecule or show different success rates depending on the fidelity level?

2024-11-21

We thank the reviewer for their useful feedback.

Maximizing the likelihood that the molecule would also be generated at fidelity k−1 with a high property score apparently limits exploration space, and the oracle at the lowest level is likely to misguide the searching spot due to its poor accuracy.

Limiting the exploration space is the goal of the likelihood term as it reduces the number of oracle calls that must be expended at the highest fidelity, and should not be seen as a weakness of our method. While it is possible that the lower fidelity oracles somewhat misguide the search for good compounds at the higher fidelities, that is also not a major concern for our method. First, even the lowest fidelity oracle is still somewhat accurate (see the results in Appendix B.1 in the updated draft), so molecules that scored well at this level are more likely to show binding at the highest fidelity than randomly chosen molecules. Therefore, it makes sense to restrict the search at the higher fidelities to these regions that are richer in active compounds, even if not all regions end up being fruitful. Second, in order to make the search for good molecules at the highest fidelity computationally tractable, we must use the lower fidelity oracles to discard some regions of chemical space and focus our search. Since there are many potential scaffolds that can score well at the highest fidelity, our method is still able to find molecules that have good scores, even if some good scaffolds were discarded by the low fidelity methods. We have added discussion of this point to the Limitations section of the updated draft.

…as shown in Table 2, the top 3 results w/o likelihood term are not significantly lower than with it. The authors need to elaborate on the potential effect of using the likelihood constraint (eq 3) on constraining the exploration.

The likelihood term significantly improves the mean ABFE score of the generated compounds (Table 2). As the generated compounds are also highly structurally diverse (Section 4.3), this means that the likelihood term allows for the generation of a large number of diverse drug candidates with good binding (low mean ABFE scores). This is relevant for medicinal chemists, who desire a large number of structurally diverse and strongly binding drug candidates to pick from for further optimization. In contrast, MF-LAL without the likelihood term generates compounds with a much worse mean ABFE score, meaning there are fewer strongly binding compounds for medicinal chemists to choose from (even if the top compounds have similar scores). Thus, using the likelihood term gives more desirable results than without the likelihood term.

Using only 15 generated compounds is too few to evaluate statistically meaningful performance, which casts doubt on the conclusions. In particular, the top 3 values strongly rely on sampling. The authors may evaluate with more samples or discuss the limitations.

We agree that only evaluating 15 generated compounds may seem limiting. However, this is mainly due to the high computational cost of ABFE, as evaluating the compounds from a single method takes 9.33 hours * 15 = 5.8 days. Nonetheless, we have now evaluated an additional 15 compounds (30 in total) from MF-LAL and only the single most competitive baseline for each target. These new results have been added to Table 1 in the updated draft. We found that MF-LAL still generates compounds with better ABFE scores than baselines for both targets. To evaluate the statistical significance of our results, we measured the p-value between the mean ABFE score of generated compounds from MF-LAL and the most competitive baseline. For BRD4(2), we computed p=0.16, and for c-MET, p=0.001. At least for c-MET, MF-LAL generates compounds with a significantly better mean ABFE score than the most competitive baseline. For BRD4(2), we approach but do not reach significance, likely because the spread of ABFE results appear to be larger for that target.

Training the proposed model is terminated with a total time limit, which causes an unsystematic convergence criterion… The authors need to justify the consequences of their termination criterion (7 days) on the model performance.

Terminating training after a fixed run time most accurately reflects the real-world use case of our method in a resource-constrained drug discovery context. Waiting until training error converges for baselines such as the single-fidelity ABFE method would involve many expensive oracle calls and thus take a long time, so it is instead more valuable to measure the accuracy after expending some fixed computational budget. Additionally, accuracy and run time are correlated, so terminating training after a fixed run time does not yield unsystematic results.

(response continued below)

2024-11-21

The authors define a set of probabilistic decoder networks that translate a latent vector at each fidelity level to a single original molecule. This reviewer wonders about the performance of each decoder in terms of success rates in reconstructing the original molecule. Do they all give the same correct molecule or show different success rates depending on the fidelity level?

Thank you for the suggestion! We have added results measuring the reconstruction rate of each decoder in our architecture during the active learning process. Appendix C.2 in the updated draft shows our results. Briefly, we found that the lower fidelity latent spaces have generally lower reconstruction accuracy than the higher fidelity latent spaces, which is likely because the higher fidelity latent spaces have a more limited compound space to decode. However, the reconstruction accuracy is generally high across all fidelities, showing that MF-LAL successfully learned the mapping from latent space to molecule.

2024-11-25

Thanks for the clarification!

评论- Thank you for the update.

2024-11-25

This reviewer appreciates the author's efforts during the revision, especially their performance of additional expensive experiments in such a limited time. However, some of my previous comments still need to be clarified.

First, regarding the issue on " the top 3 results w/o likelihood term are not significantly lower than with it. The authors need to elaborate on the potential effect of using the likelihood constraint (eq 3) on constraining the exploration.", the authors claimed the mean value with the likelihood is better than without it. That is true, but the top 3 scores w/o the likelihood are close to those with it; top N molecules are more important than the mean value in practice. The significantly low mean value compared to the top 3 values may reflect the proposed method's inefficiency in identifying high-potency molecules. This result would be related to the second concern below.

Second, though the authors claimed that training with an affordable fixed computational budget does not give unsystematic results, this reviewer disagrees with this opinion. Each target protein may require different optimal running times for full convergence, so running with a fixed time will give results with high variance depending on the convergence degree for each target. This may be related to my first concern. A limited running time can give less converged results, making it challenging to evaluate the proposed method's performance statistically. This reviewer also understands that the notoriously high computation cost of ABFE does not allow sufficient running time, but this is a practical limitation of the proposed method that should be admitted.

2024-11-28

Thank you for your response!

That is true, but the top 3 scores w/o the likelihood are close to those with it; top N molecules are more important than the mean value in practice.

We agree, but our choice of top 3 is somewhat arbitrary. Considering the top 5 or 10 molecules, for instance, would show a clearer advantage for MF-LAL with the likelihood term.

This reviewer also understands that the notoriously high computation cost of ABFE does not allow sufficient running time, but this is a practical limitation of the proposed method that should be admitted.

We still think using a fixed computational budget gives systematic results, because the performance of each method after a fixed run time should be (inversely) proportional to the final convergence time. Nonetheless, we do agree that our experiments may have high variance because of the limited computation time, so we have added some discussion of this issue to the Limitations section in the updated manuscript.

2024-12-02

This reviewer also agrees that the fixed run time can give a systematic result, but this is only valid for systems of the same size. What if one has much larger proteins? Then, the fixed run time will give a different accuracy depending on the size of the systems.

审稿意见

评分: 5置信度: 42024-11-04

The paper introduces a novel framework called Multi-Fidelity Latent Space Active Learning (MF-LAL) for de novo drug design. The primary challenge addressed is the generation of drug compounds with real-world biological activity, a task complicated by the limitations of current evaluation methods. Generative models often rely on molecular docking as an oracle to guide compound generation, but docking scores do not consistently correlate with experimental activity. Authors state that more accurate methods like molecular dynamics-based binding free energy calculations exist but are computationally prohibitive for large-scale generative tasks. MF-LAL proposes integrating multiple oracles of varying fidelity level, each with different accuracy and computational cost, into a single generative modeling framework. Unlike previous approaches that separately train the surrogate model and the generative model, MF-LAL combines them, allowing for more accurate activity predictions and higher-quality sample generation. The framework employs a novel active learning algorithm to efficiently select which oracle to query at each step, further reducing computational overhead. Experimental results on two disease-relevant proteins demonstrate that MF-LAL outperforms both single and multi-fidelity baseline approaches in producing compounds with better binding free energy scores.

优点

This work addresses one of the most important problems in de novo drug design, the translational efficiency that oracles transmit to the generation process.

The hierarchical latent space, each corresponding to a fidelity level, is a clever design choice. This allows the model to specialize the latent representation for each fidelity, improving both surrogate modeling and the quality of generated compounds.

The use of SELFIES is really appropriate for this setup as this work is not focus on broader chemical exploration.

The introduction of a likelihood-based term in the generation objective is a smart technical trick. Specifically, when generating a molecule at fidelity level $k$ , the model maximizes the likelihood that this molecule would also be generated at fidelity level $k-1$ with a high property score.

Surrogate models $\hat{f}_k$ are directly trained on their corresponding latent representations $z_k$ , leveraging SVGPs. This integration ensures that the latent spaces are organized to optimize property prediction at each fidelity, enhancing the overall performance of both the generative and surrogate models. The use of SVGPs for surrogate modeling at each fidelity level introduces mathematical robustness. The joint minimization of the ELBO and the marginal log likelihood ensures that the latent representations are well-structured for property prediction

缺点

The whole technical approach is fantastic. However, this work has some issues conceptually and with the evaluation of its results.

While it's true that the degree of resolution of free binding energies estimation can be higher for some systems, this is taken as a pillar fact in this work while this might not be true. It is of utmost importance to check any physics-based protocols with compounds with known affinity and actually corroborate that the low to high-fidelity methods stand as such.

To that end, authors select 15 compounds for each method and compute ABFE metrics on all of them. However, they base a great deal of their analyses on 3 top scoring compounds per system. When looking at the mean of the 15 compounds generated vs each of the values for the top 3 compounds it's clear to see that they are outliers, which might not be bad per see (it is exactly what a generative campaign is looking for) but must be always followed by a careful analysis of the motifs the molecules generated. Scoring functions of any type, docking or ABFE, are biased towards certain types of chemical structures and assign a high score as a result of highly reactive topologies. For instance, best molecule in BRD4 has an SO2 group, known to be pan-reactive, best molecule in c-MET has an aldehyde group, also known to bind to many chemistries potently. The other two compounds in c-MET have nitro groups, with charges known to be rewarded by scoring functions. While those are valid molecule they would be potentially discarded since they would be considered artifacts.

Also, the approach is lacking from benchmarking the systems on docking and ABFE with compounds known to bind to them. That would give a more solid comparison point for generated molecules.

问题

I encourage the authors to address the following:

Perform docking and ABFE to known binder compounds.
Revisit the docking results taking into account the aforementioned issues.

2024-11-21

We thank the reviewer for their useful feedback.

While it's true that the degree of resolution of free binding energies estimation can be higher for some systems, this is taken as a pillar fact in this work while this might not be true. It is of utmost importance to check any physics-based protocols with compounds with known affinity and actually corroborate that the low to high-fidelity methods stand as such

the approach is lacking from benchmarking the systems on docking and ABFE with compounds known to bind to them. That would give a more solid comparison point for generated molecules.

We would like to emphasize Appendix B.1, which contains results about the accuracy of each oracle on known binders. In brief, we measure the binary classification ability (ROC-AUC) of each oracle to distinguish between known actives and presumed inactives for BRD4(2). The results confirm that the presumed higher fidelity oracles indeed have higher accuracy. In light of the reviewer’s comment, we have now conducted a similar experiment for the other protein target, c-MET, and have included the new results in Appendix B.1. These new results also show that the higher fidelity oracles are more accurate.

…authors select 15 compounds for each method and compute ABFE metrics on all of them. However, they base a great deal of their analyses on 3 top scoring compounds per system. When looking at the mean of the 15 compounds generated vs each of the values for the top 3 compounds it's clear to see that they are outliers, which might not be bad per see (it is exactly what a generative campaign is looking for) but must be always followed by a careful analysis of the motifs the molecules generated.

Our evaluation of the generated compounds is based on the real-world use case of our method, which would involve taking the top generated compounds and using them as starting points for further optimization. In this case, the most relevant factor is how strong the binding is for the top generated compounds, so we do not think focusing on them is a weakness of our work. Indeed, evaluating only the top generated compounds is common practice in the literature on generative modeling for molecules [1, 2, 3].

[1] Lee et al. “Exploring Chemical Space with Score-based Out-of-distribution Generation.” ICML 2023.

[2] Luo et al. “GraphDF: A Discrete Flow Model for Molecular Graph Generation.” ICML 2021.

[3] Fu et al. “Reinforced Genetic Algorithm for Structure-based Drug Design.” NeurIPS 2022.

Scoring functions of any type, docking or ABFE, are biased towards certain types of chemical structures and assign a high score as a result of highly reactive topologies.... While those are valid molecule they would be potentially discarded since they would be considered artifacts.

The highly reactive motifs mentioned by the reviewer are known to be overly rewarded by docking scoring functions. However, it is not clear that the same is true for the more accurate ABFE calculations. This exemplifies the benefits of the multi-fidelity approach, where inaccuracies in lower fidelity oracles can be corrected by higher-fidelity, more accurate methods. Additionally, while the presence of the mentioned motifs might be concerning for medicinal chemists, the generated compounds that contain them still show strong binding and would therefore be valuable as starting points for further optimization (e.g. improvement of toxicity).

评论- Summary of changes

2024-11-21

We have uploaded an updated version of the manuscript in response to the reviewers’ feedback, with changes highlighted in red. Briefly, we have:

Run ABFE on 15 additional compounds generated by MF-LAL and the most competitive baseline for both targets to better evaluate the difference between mean ABFE scores of generated compounds. Comparing MF-LAL with the most competitive baseline, statistical tests give a p-value of $p=0.16$ for BRD4(2) and $p=0.001$ for c-MET. These experiments alleviate concerns from Reviewers cHMx and bhTM about the statistical robustness of our results. The new results are included in Table 1.
Run an additional baseline, DecompDiff, that represents the state-of-the-art in 3D pocket-aware generative models (requested by Reviewer wbaj). We found that MF-LAL generated compounds with significantly better ABFE scores than DecompDiff across both targets. Results are in Table 1.
Confirmed that higher fidelity oracles are more accurate for the c-MET target. We ran each of our oracles on known binders and non-binders of c-MET, and confirmed that the higher fidelity oracles are better at distinguishing between binders and non-binders. These results address the concerns of Reviewer SEcZ about the hierarchy of oracles. Results are in Appendix B.1.
Plotted the reconstruction accuracy of the decoders at each fidelity during active learning. We found that lower fidelity latent spaces generally have worse reconstruction accuracy, although it is still relatively high across all fidelities. These results, which were requested by Reviewers cHMx and wbaj, confirm that MF-LAL successfully learns a mapping from latent points to molecules. Results are in Appendix C.2.

AC 元评审

2024-12-19

Adaptively calling oracles of different fidelity, Multi-Fidelity Latent Space Active Learning (MF-LAL) was proposed for efficient de novo drug design. The main contribution is to develop a latent active learning strategy to efficiently query oracles considering the model accuracy and computational budget for molecular design. Experimental results comparing MF-LAL with both single and multi-fidelity baselines were reported in producing compounds with better binding free energy scores.

The reviewers have expressed concerns on mathematical rigor and practical significance of MF-LAL. More comprehensive and principled evaluation experiments were also requested by including more baselines to better establish the significance of the proposed framework.

The authors may want to consider carefully addressing these concerns to improve the quality of their future submission.

审稿人讨论附加意见

During rebuttal, the authors have corrected some of the errors and provided clarifications. However, the reviewers still have issues with the limited baseline comparison, biased performance comparison, and mathematical rigor. One reviewer also commented about the significance of the presented work: "the paper proposes a really complex method and does not convincingly show that it overcomes limitations of previous methods for the same task."

最终决定Reject

2025-01-22

Reject