RetroBridge: Modeling Retrosynthesis with Markov Bridges
摘要
评审与讨论
This paper addresses single-step retrosynthesis. The single-step retrosynthesis is formalized as the mapping problem from product molecule graphs to reactant molecule graphs. The authors propose the Markov Bridge Model, which learns the dependency between two discrete variables, and the model is applied to the graph mapping problem. The proposed method outperforms state-of-the-art template-free methods on the USPTO-50K benchmark.
优点
- The Markov Bridge Model is a novel and interesting approach for discrete variables and can be applied to other fields.
- The evaluation results using top-k round-trip coverage and accuracy support the potential of the proposed method.
- The future work described in Section 5 clearly describes the limitations of this work.
缺点
- The reason for selecting graph-based representation needs to be explained. It is not clear to me why molecules are represented as a graph instead of the 3D coordinates of atoms since 3D coordinates are frequently used in the unconditional generation of molecules (Hoogeboom et al., 2022).
问题
No question.
伦理问题详情
No concern.
Thank you for the very positive review. We are glad you liked our discussion section and see potential use cases of Markov bridges in other fields.
To answer your concern about the graph representation, we would like to say that, in most cases, there are different arrangements of 3D coordinates for a given bond connectivity of a molecule (which is described very well by a graph) - in fact, these are distributed according to their energy. However, how should we pick the most relevant 3D coordinates for the starting molecule(s), as an input to our model? Specific conformers usually depend on external factors (e.g. if the ligand is in a bound or unbound state) and while some reactions are highly context dependent (e.g. enzymatic reactions), this context is usually unknown a priori. At the same time, predicting and testing multiple conformations would be prohibitively expensive. Using the molecular graph instead of 3D information abstracts away the problem but still captures the most important features/behaviour of the molecules from a chemistry perspective. For this reason, almost no other work in ML for retrosynthesis uses 3D coordinates as input.
Nevertheless, we agree that 3D representations of molecules can be extremely useful for some tasks in computational chemistry, especially if a unique (low energy) structure is available.
I appreciate the response from the authors. The answers to my questions were satisfying.
I have also read the opinions of the other reviewers, and I believe this paper should be accepted.
The paper presents a retrosynthesis model where the reactant is diffused to product, inducing a Markov Bridge. The method is simple and gets state-of-the-art performance.
优点
- The presentation of the paper is clean, and the paper is easy to read and understand.
- The method is principled and simple, and seems to work well. The idea of flowing product to reactant is very sensible.
缺点
- The method is incremental: the diffusion is directly from Austin, and I think the loss function is also only a variant of diffusion losses. The network is directly from literature. There is no chemical inductive biases that I can identify.
- The performance is opened up somewhat, but not sufficiently. I like the ablations that show the context and VLB help, but these are generally known from diffusion literature already. It would have been good to try to expose why the performance is better than MEGAN. For instance, how much of the results are from network tuning or training tricks?
- There are many missing competing methods, such as RetroFormer and Graph2Smiles.
问题
- The paper misses quite a lot of related works. Retroformer, Graph2Smiles, GTA, Tied transformer all seem to be missing. The paper needs to cite methods from 2022 and 2023, and include USPTO benchmark comparisons.
- If you pin the outcome to y in eq 2, why does fig1 show that we start from x, and diverge into a bunch of y’s? Shouldn’t fig1 show starting from single x, fanning out, and reducing all mask down to single y?
- What does t ~ U(0,…,T-1) mean in eq 7? Seems very odd… Why doesn’t it go to the end T? Why do we have multiple values?
- What happens to the last timestep in eq 7? This seems to be a Dirac, which probably poses problems. Currently it seems that the final prediction is never evaluated, and we only predict until z_{T-1} instead of matching the final z_T to y. How do you evaluate the final prediction z_T?
- What is p(z|x,y)?
- What is p(z_t+1 | z_0, z_T)? Where does this go? Why do we talk about this? Does this have something do with p(zt | z0, y)?
W2: The performance is opened up somewhat, but not sufficiently. I like the ablations that show the context and VLB help, but these are generally known from diffusion literature already. It would have been good to try to expose why the performance is better than MEGAN. For instance, how much of the results are from network tuning or training tricks?
We are happy to hear that you find value in our ablation studies. We believe that this kind of analysis should be independently conducted for each specific use case. For example, previous work on graph generation with diffusion models [1] found a simple cross-entropy loss to perform well, while we observed improved performance using the more principled VLB.
While we did not derive our method from MEGAN and cannot trace what exactly improves the performance, we argue that sampling proportional to p(reactants|product) is important to find a diverse set of valid reactants to provide more good options to a multi-step pipeline (as eventually supported by our top-k and round-trip results for k>1). Our framework is designed to approximate this probability while MEGAN optimizes the likelihood of a specific sequence of edits, imposing a more or less arbitrary order, even though different sequences can lead to the same outcome. We therefore expected our formalism to be better suited to the task, which might explain the superior performance.
We selected training and network hyperparameters among a small set of options (CE and VLB loss functions, context/no-context) and selected the best model based on its performance on the validation set before the final test set evaluation. We do not expect unusually large gains from network tuning alone. Besides, it is reasonable to believe that a similar amount of hyperparameter tuning has been performed for other retrosynthesis models.
W3: There are many missing competing methods, such as RetroFormer and Graph2Smiles.
Q1: The paper misses quite a lot of related works. Retroformer, Graph2Smiles, GTA, Tied transformer all seem to be missing. The paper needs to cite methods from 2022 and 2023, and include USPTO benchmark comparisons.
We thank the reviewer for these comments. We added comparison with all three methods in our tables. Besides, we added even more methods in a (new) Table 4, where we provide more information on the technical aspects of the methods and evaluation methodology.
Q2: If you pin the outcome to y in eq 2, why does fig1 show that we start from x, and diverge into a bunch of y’s? Shouldn’t fig1 show starting from single x, fanning out, and reducing all mask down to single y?
Equation 2 provides a required property of a Markov bridge. Figure 1 represents three different Markov bridges that start at the same point and end at three different points , and . Thus, for each of these bridges there is a specific requirement that it should end at a specific point. For example, for the first (orange) bridge the requirement from eq. 2 will be written as .
Q3: What does t ~ U(0,…,T-1) mean in eq 7? Seems very odd… Why doesn’t it go to the end T? Why do we have multiple values?
means that the time step is sampled from the uniform distribution over a discrete set of values . As shown in Appendix A.1, we replace the sum over all steps with the expectation over as it is an unbiased estimator that is crucial for efficient simulation-free learning. In particular, we move from sum to expectation in transition (29) (30) (Appendix A.1). It does not go to the end, , because for a given we compute the KL divergence between distributions of . It means that for we will compute the KL divergence for distributions of .
Q4: What happens to the last timestep in eq 7? This seems to be a Dirac, which probably poses problems. Currently it seems that the final prediction is never evaluated, and we only predict until z_{T-1} instead of matching the final z_T to y. How do you evaluate the final prediction z_T?
As we discuss in the answer to the previous question, for a given we compute the KL divergence between distributions of . It means that for we will compute the KL divergence for distributions of the final state . Therefore, for the last timestep , we compute which is precisely the divergence between the ground-truth distribution of the final value given the previous value and target , and approximated (by the neural network) distribution of given . No Dirac functions appear here.
Q5: What is p(z|x,y)?
is the distribution of the intermediate state at time step of the Markov bridge that starts at the point and ends at the point .
Q6: What is p(z_t+1 | z_0, z_T)? Where does this go? Why do we talk about this? Does this have something do with p(zt | z0, y)?
is the distribution of the intermediate state at time step of the Markov bridge that starts at the point and ends at the point . In the context of the previous question, is interchangeable with and is interchangeable with . We explicitly mention it saying that and in the paragraph after Equation 9. These variables are introduced to simplify the notation.
References
[1] Vignac, Clement, et al. "Digress: Discrete denoising diffusion for graph generation." arXiv preprint arXiv:2209.14734 (2022).
[2] Anand, Namrata, and Tudor Achim. "Protein structure and sequence generation with equivariant denoising diffusion probabilistic models." arXiv preprint arXiv:2205.15019 (2022)
Thank you for highlighting the strong performance of our method and for the thorough review of our work. We believe the additional baseline methods you proposed are a valuable contribution to the quality of our paper. Below, we address all your comments and questions in detail.
We hope you find our arguments convincing, but should you still have doubts that prevent you from raising your score, we will be happy to engage in a discussion during the discussion stage. We will also try our best to incorporate any additional suggestions you might have for further improving our manuscript. We are looking forward to hearing from you.
W1: The method is incremental: the diffusion is directly from Austin, and I think the loss function is also only a variant of diffusion losses. The network is directly from literature. There is no chemical inductive biases that I can identify.
We respect the opinion of the reviewer on this very subjective matter, but we would like to highlight our contributions that, in our opinion, make our work clearly novel (beyond incremental improvements) and differentiate it from previous generative modeling frameworks. None of the existing approaches, including diffusion models (DM), conditional flow matching (CFM) and diffusion Schrödinger bridge (DSB) models, address the problem of mapping between two intractable, discrete distributions. DMs do not allow for an intractable prior. CFMs do not operate on discrete data and do not optimise p(y|x); DSBs either do not operate on coupled samples or do not work with discrete data. Therefore, to the best of our knowledge the proposed Markov Bridge Model (MBM) is the first method that fulfills all these requirements. We kindly ask the reviewer to provide specific pointers to other works if there is reason to believe our evaluation of the prior art is incomplete.
Apart from this general remark, we address each of your specific concerns below:
- We took the discrete data formalism from Austin et al., as other papers did too [1, 2], but we want to stress that MBM is not a diffusion model. The major differences are that neither a tractable prior nor a forward noising process are required.
- The loss function is a variational lower bound on the data log-likelihood which is common practice in the generative modelling community. Here, we derive the specific form of this lower bound for the proposed Markov bridge model and show empirical advantages over a previously used simplified cross-entropy loss.
- The network is a variable component of our framework that can depend on the task and data structure (e.g. if you apply MBMs to text, you can plug in an LSTM as a backbone model instead). As developing a new type of GNN is not the focus of this work, we took the same architecture as in DiGress which was shown to perform well on molecule-related tasks. We would add that it even makes our comparison with DiGress more fair and highlights our contribution on the methodological side w.r.t. the new MBM framework.
- Finally, we did not use major chemical inductive biases because our work is motivated by the statistics of the retrosynthesis data rather than chemical intuition. The main inductive bias we investigate in this paper is the one-to-many mapping between products and reactants that is naturally captured by our proposed framework. The benefits of this idea become apparent in the round-trip evaluation where RetroBridge shows significant empirical gains compared to baselines.
Thanks for the responses, these did clarify many issues. I’m happy to see more comparisons and the empirical results have become even more impressive. I agree that there is novelty here. Yet, I think the method is more in the incremental end of the spectrum, and the results are more or less a black-box with limited insights into why it works so well or what has it learnt.
Nevertheless, I’m raising my score to 6, and would be happy to see this paper accepted.
The authors introduce a novel approach to retrosynthesis planning by modeling it as a distribution learning problem in a discrete state space. The core of this approach is the Markov Bridge Model, which is a generative framework designed to approximate the relationship between two discrete distributions that are intractable. By optimizing a variational lower bound, similar to that in diffusion models, the method learns a Markov process that transitions between the product and reactant distributions. Empirical results suggest that the proposed method surpasses existing ones, achieving state-of-the-art results on standard benchmarks.
优点
- The paper offers a novel perspective on retrosynthesis planning, framing it as a distribution learning problem on a discrete distribution instead of a discriminative learning problem. The introduction of the Markov Bridge Model in this context and its comparison with the diffusion model is novel and interesting.
- The presentation of the methodology is clearly written and easy to follow.
- The derivation of the variational objective, which involves maximizing a lower bound of log-likelihood, is theoretically sound.
- The empirical result is competitive against existing one-step retrosynthesis models.
- The proposed method is general and can potentially be extended to other application domains involving the learning of transitions between discrete distributions.
缺点
- The paper does not discuss the time required during training and inference stage of the proposed method, including the time step T, the number of samples needed to converge on the top-k prediction, etc.
- From the atom/bond level parametrization in Section 3.2, it seems that the required T could be quite large. An ablation study to report the top-k performance vs. different Ts could be very useful.
问题
- How much more time is needed during the training and inference phase compared with the baselines?
- What's the choice of T in the experiments, and how many samples are needed to obtain the results?
We thank the reviewer for the generally positive review and useful suggestions, which we have incorporated into the updated manuscript. We are happy to hear that the reviewer shares our vision that Markov Bridge Models might find applications in other domains as well.
W1: The paper does not discuss the time required during training and inference stage of the proposed method, including the time step T, the number of samples needed to converge on the top-k prediction, etc.
We elaborate on this issue in our responses to Questions 1 and 2 below.
W2: From the atom/bond level parametrization in Section 3.2, it seems that the required T could be quite large. An ablation study to report the top-k performance vs. different Ts could be very useful.
We thank the reviewer for this useful remark and provide an ablation study in Appendix A.3.5. As shown in the new Figure 5A, the performance of our model does not significantly change with up to x2 speed improvement (i.e. when sampling with T=250 instead of T=500). The performance starts degrading notably only after a 5-fold reduction of the number of time steps.
Q1: How much more time is needed during the training and inference phase compared with the baselines?
Despite the huge success and the wide applicability of diffusion models and similar score-based generative methods, a known bottleneck of such algorithms is sampling. We also cannot avoid iterative sampling and hence the inference procedure requires hundreds of forward passes. Our model reported in Table 1 takes 1.3 seconds to sample reactants. For comparison, it takes 0.0064 and 0.0065 seconds for MEGAN and GLN respectively. We however note that 1 second for sampling a set of reactants for a given product is a completely feasible time for applications of models like ours. In particular, this speed does not prevent the use of our method as a component of a multistep retrosynthesis planning pipeline.
Regarding training times, we trained our model 72 hours on a single Tesla V100-PCIE-32GB GPU and selected the best model based on top-5 exact match accuracy on the validation set (see Section A.3.4). We did not use early stopping or perform ablations with earlier checkpoints since training is only a one-time cost and depends on many unrelated hyperparameters, e.g. the number of validation steps, which impedes comparisons with other methods.
Q2: What's the choice of T in the experiments, and how many samples are needed to obtain the results?
We sample with T=500 for all our experiments, and generate 100 samples per input product. We additionally perform an ablation study on the number of samples (see Appendix A.3.5). As shown in the new Figure 5B, RetroBridge is very robust to the reduction of the number of samples.
Thanks for the reply! Would be helpful to include the answers to Q1 in the paper as well.
Thanks! We added this info in Sections A.3.4 and A.3.5.
This paper proposes a novel single-step retrosynthesis method called RetroBridge, utilizing the Markov Bridge Model. RetroBridge formulates the reactants and products in single-step retrosynthesis as two discrete distributions and learns the dependencies between them. By performing conditional sampling, the generated reactions demonstrate high accuracy.
优点
- To the best of my knowledge, this paper is the first to introduce the concept of the single-step retrosynthesis problem by framing it as a distribution fitting task, akin to a diffusion model.
- The proposed model achieves the SOTA Top k accuracy (k > 3) on USPTO-50K.
- The comparison with diffusion model makes a perfect sense.
缺点
- The top 1 accuracy achieved on the USPTO-50K dataset is relatively low, and a similar trend is observed in the forward prediction task. It would be helpful if the authors could provide an explanation for these results.
- Some baseline methods are missing, which may perform much better than RetroBridge, like DualTF, RSMILES, PMSR, etc.
- In comparison to other TF methods, RetroBridge requires atom-to-atom mappings, which could potentially pose limitations on its feasibility. This constraint could become a hindrance when applying RetroBridge to larger datasets such as USPTO-Full.
问题
This paper provides a fresh perspective on AI for retrosynthesis; however, it could benefit from further elaboration on the chemical aspects involved. Some previous works, like GraphRetro and MEGAN somehow, mimic the process of chemical reactions. So, how Markov bridge formulates chemical reactions? I would consider raising my rating if the authors addressed my concerns.
We thank the reviewer for recognizing the strengths of our work but also evaluating it with a critical eye. We are particularly grateful for the list of additional baselines that have been missing in our initial submission. We added all the suggestions to our manuscript and addressed further concerns and questions below. We believe to have improved the quality of our paper and hope to have provided satisfactory answers to each of your questions. If there is anything else we can do for you to consider raising your score, please let us know. We are looking forward to hearing from you.
W1: The top 1 accuracy achieved on the USPTO-50K dataset is relatively low, and a similar trend is observed in the forward prediction task. It would be helpful if the authors could provide an explanation for these results.
Since one product can be obtained through different reaction pathways and different starting sets of reactants, top-1 accuracy reports nothing but the ability of the model to guess what exact reaction for a given product was patented and therefore included in the dataset. We stress that top-3 and top-5 are much more important metrics as they reflect the probabilistic aspect of the problem. Besides, as it was discussed in [1], the typically expected or desired breadth of the multi-step retrosynthesis planning tree is 5 which makes top-5 accuracy the preferred metric. We additionally emphasize that, as shown in Table 3, our model was optimized for top-5 accuracy. We added these additional remarks in Section 4.3.
[1] Maziarz, Krzysztof, et al. "Re-evaluating Retrosynthesis Algorithms with Syntheseus." arXiv preprint arXiv:2310.19796 (2023).
W2: Some baseline methods are missing, which may perform much better than RetroBridge, like DualTF, RSMILES, PMSR, etc.
We thank the reviewer for these suggestions. We significantly extended Tables 1 and 2 and, in particular, added DualTF to Table 1. We would like to mention, however, that DualTF is an ensemble of 3 models, one of which is a ranking model in the forward direction, which all the other baseline models do not use. We would also like to note that the comparison with PMSR and RSMILES is not fair. While all the methods in Table 1 (including ours) are trained on the same subset of USPTO-50k, PMSR is additionally pre-trained on 3.7M reactions from Pistachio. Likewise, RSMILES is pre-trained on USPTO-FULL and their test set is x20 augmented. We therefore add these methods only for a full picture in the extended Table 4 in the Appendix, where we provide more information on the technical aspects of the methods.
W3: In comparison to other TF methods, RetroBridge requires atom-to-atom mappings, which could potentially pose limitations on its feasibility. This constraint could become a hindrance when applying RetroBridge to larger datasets such as USPTO-Full.
While RetroBridge uses atom mapping for training, this is not conceptually required. It is true that a training set with all correctly mapped molecules significantly simplifies the training process. However, although we did not perform such experiments, we speculate that the probabilistic nature of our model allows for minor inconsistencies in the atom mapping. With the modern high-performant tools for atom mapping such as RXNMapper [1], we believe that the level of possible mapping inconsistencies is negligible for the performance of RetroBridge. Also, there have been to the best of our knowledge no studies so far in the literature which empirically demonstrate that errors in the atom mapping lead to significant practical performance loss for any kind of retrosynthesis model.
[1] Schwaller P. et al. “Extraction of organic chemistry grammar from unsupervised learning of chemical reactions”
Q1: This paper provides a fresh perspective on AI for retrosynthesis; however, it could benefit from further elaboration on the chemical aspects involved. Some previous works, like GraphRetro and MEGAN somehow, mimic the process of chemical reactions. So, how Markov bridge formulates chemical reactions? I would consider raising my rating if the authors addressed my concerns.
Indeed, some prior works are inspired by the arrow pushing formalism from organic chemistry using discrete graph edits. Here, we instead build on statistical intuition and the fact that each product molecule can be obtained starting from several sets of reactants. We argue that sampling proportional to p(reactants|product) is important to find a diverse set of valid reactants to provide more good options to a multi-step pipeline. Our framework is designed to approximate this probability while autoregressive models typically optimize the likelihood of specific sequences of edits, imposing a more or less arbitrary order, even though different sequences can lead to the same outcome. We therefore expected our formalism to be better suited to the task, which might explain the superior performance on the round-trip metrics.
Additionally, MEGAN also makes heavy use of chemical input features (atomic number, formal charge, chiral tag, and number of explicit hydrogen atoms is aromatic, bond type, bond stereometry) which might provide additional cues to solve the task. We believe it could be an interesting future direction to investigate if endowing RetroBridge with more chemically-relevant input features can further improve its empirical performance.
We thank all reviewers for their generally positive feedback and useful comments. We were excited to see that most of the reviewers find the Markov Bridge formulation conceptually novel and envision future use cases in other domains. The reviewers also highlighted the strong empirical performance of RetroBridge, especially for the most relevant exact match metric, top-5 accuracy [1].
Based on the reviewers’ valuable suggestions, we made a number of improvements to the paper that you can find in the updated PDF. Changes are highlighted with a teal color. More specifically:
- We added more baselines both in Table 1 and in a new extended table in the appendix (Table 4), in which we also provide some methodological details for each model.
- We added new results showing how top-k accuracy varies depending on the number of sampling steps T. (Figure 5A)
- We show the dependency of RetroBridge’s performance on the number of samples generated per product molecule. (Figure 5B)
We tried our best to address remaining questions and concerns in the individual responses to each reviewer below.
[1] Maziarz, Krzysztof, et al. "Re-evaluating Retrosynthesis Algorithms with Syntheseus." arXiv preprint arXiv:2310.19796 (2023).
This paper presents a new Markov Bridge Model, RetroBridge, for single-step retrosynthesis model. RetroBridge proximate the relationship between two discrete distributions for the products and reactants, that are intractable. Empirical results show that the proposed method achieves state-of-the-art results on standard benchmarks.
Firstly, the method is principled and offers a novel perspective on retrosynthesis planning. Secondly, the experiments clearly demonstrate the effectiveness of the proposed method. The presentation is very clear for the readers to follow.
However, the key component is from existing works with limited insights, so that its novelty is somehow incremental. It would be better if the paper could study deeper and provide some explanations in a chemical way.
为何不给更高分
Its novelty is somehow incremental, since the key component is from existing works with limited insights.
为何不给更低分
The method offers a novel perspective, and works very well.
Accept (spotlight)