6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性3.3

质量3.0

清晰度2.8

重要性2.8

NeurIPS 2025

Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs

Mianchu Wang,Giovanni Montana

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

TL;DR

We propose Interactive Retrosynthesis Planning, a novel framework that learns to construct retrosynthetic routes by interacting with tree MDPs and optimising a worst-path objective by self-imitation learning.

摘要

关键词

Retrosynthesis planningreinforcement learningself-imitation learning

评审与讨论

审稿意见

评分: 5置信度: 22025-06-08

This paper introduces a new approach to retrosynthesis planning called InterRetro, which builds synthesis routes for molecules by focusing on the weakest part of the route.

The key idea is to ensure that all paths in the synthesis tree end in purchasable compounds, since one failed path makes the whole plan invalid.

InterRetro treats the planning process as a worst-path optimisation problem in a tree-structured Markov Decision Process and improves its decision-making using self-imitation learning.

优缺点分析

Strengths:

The paper reframes retrosynthesis planning as worst-path optimisation in tree-structured MDPs, which is both novel and well-justified.

It derives Bellman equations for the worst-path value and proves the existence and uniqueness of the optimal value.

Strong empirical results achieving state-of-the-art performance.

The code is available, and the provided example is reproducible.

Weaknesses:

The quality of the figures does not match the quality of the paper. Some figures (e.g., Figures 2 and 4) show rasterization artifacts when the paper is enlarged to a comfortable reading size, and Figure 4 in particular is difficult to interpret when printed on A4 paper.

The design of Table 1 is confusing: in the text, you state that you compare established retrosynthesis planning methods and list them together, but in the table, they appear in separate columns. I still struggle to understand what is being presented in the table, even after reviewing it several times.

问题

Out of all the listed contributions, the third one—"we apply this framework to retrosynthesis through InterRetro, a search-free approach to multi-step planning"—is not entirely clear to me. I understand the other three contributions, but this one feels more like an application rather than a contribution in itself. I would appreciate an elaboration on what makes this application a distinct contribution.

In the Tree-structured MDPs section, could you please explain why the domain of the mapping is 2^S?

What does 190 refer to in Retro*? I couldn't find any information about this in the citation link you provided—almost everywhere it's simply referred to as Retro*.

According to the metrics, your results are quite impressive. Could you elaborate on why you think this is the case and what the consequences might be for the field? Should we expect a significant leap forward in retrosynthesis planning, or do these strong metrics not necessarily translate to major changes in industry applications?

After addressing my questions and concerns, I would be willing to reconsider my evaluation.

Thank you for your submission and have a nice day!

局限性

Yes

最终评判理由

Authors have addressed my concerns, thus I am raising my score

格式问题

作者回复

2025-07-30

Thank you for your detailed feedback and for the time you spent in reviewing our manuscript. We have addressed every point you raised and will incorporate the corresponding changes into the final version.

W1. Thank you for bringing this to our attention. We have regenerated Figure 2 and Figure 4 in PDF format to ensure its readability in different devices. For Figure 2, we enlarged the size of each sub-figure, increased the font size, and ensured that all sub-figures scale cleanly on A4 paper. Because we cannot upload revised paper during rebuttal, these improved figures will appear in the final version.

W2. Table 1 shows the results for different combinations of single-step models and search algorithms across three benchmarks. Each row represents a specific combination (e.g., "InterRetro - Retro*" means InterRetro as the single-step policy with Retro* as the search method). The columns then show: (1) "DG" - direct generation without search, (2) "100", "200", "500" - results with different search budgets (maximum model calls). We will improve the caption and the table accordingly to make them easy to understand.

Q1. Contribution 1 and contribution 2 formulate the problem into a tree-structured MDP with the worst-path objective, and propose self-imitation to optimise this objective. However, it still has a distance from the theory to a practical algorithm - we need to decide how the agent interacts with the tree MDP, how the agent stores past successful experience for self-imitation, etc. Therefore, we have contribution 3, which extends the theory, adds more implementation details, and solves the retrosynthesis problem in practice. The engineering decisions and implementation details that enable this capability constitute a significant contribution beyond the theoretical framework.

Q2. The notation $2^{\mathcal{S}}$ denotes the power set of $\mathcal{S}$ , which is the set of all possible subsets of $\mathcal{S}$ . We use this because in retrosynthesis, applying a reaction to a molecule produces a set of reactant molecules (typically 1-3 reactants), not just a single molecule.

For example, if molecule $s$ undergoes reaction $a$ , it might produce two reactants { $s_1, s_2$ }. Since { $s_1, s_2$ } is a subset of all possible molecules $\mathcal{S}$ , we have { $s_1, s_2$ } $\in 2^{\mathcal{S}}$ . This is why our transition function is defined as $\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow 2^{\mathcal{S}}$ rather than $\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ . We have added this clarification to Section 3.1 in the revised manuscript.

Q3. The Retro*-190 in our paper refers to the dataset from Retro*, as described in Section 5.1 by Chen et al. (2020). $190$ is the number of target molecules included in the dataset. The dataset can be downloaded in Retro*'s GitHub page. We refer this dataset as Retro*-190 rather than simply Retro* to keep the suffix indicating the number of the test molecules in the dataset, being consistent with ChEMBL-1000 and GDB17-1000.

Q4. Our impressive results stem from two key innovations: (1) the worst-path objective that directly optimises for complete route validity, and (2) the self-imitation learning that effectively leverages successful experiences. For the research community, these results demonstrate that search-free retrosynthesis is not only possible but can outperform search-based methods. This could redirect research efforts toward end-to-end learning approaches. For practical applications, the search-free nature offers immediate benefits: real-time route generation for large molecular libraries, integration into interactive design tools, and reduced computational infrastructure requirements.

However, we must acknowledge important limitations. Our approach, like all current methods, assumes predicted reactions are experimentally feasible — an assumption that doesn't always hold. Real-world deployment would require integration with reaction feasibility predictors and experimental validation workflows. Additionally, our method currently optimises for route depth rather than other practical considerations like cost and reaction conditions.

We see our work as a significant step toward practical automated synthesis planning, but achieving industry-ready systems will require addressing these additional challenges. The impressive metrics indicate that the core planning problem is becoming tractable, allowing the field to focus on these remaining practical considerations.

Finally, we thank you again for raising important questions about our submission. Have we adequately addressed the main concerns? Please feel free to let us know if there are additional concerns or questions. Have a nice day!

Reference

Chen, Binghong, et al. Retro*: Learning retrosynthetic planning with neural guided A* search. ICML. 2020.

2025-08-04

Thank you for the clear and thoughtful rebuttal. Your clarifications effectively address all concerns raised, including figure readability, table interpretation, theoretical contributions, notation, dataset naming, and practical implications. No further concerns.

审稿意见

评分: 4置信度: 52025-06-24

This paper formulates retrosynthesis as a worst-path optimization problem within tree-structured Markov Decision Processes, and proposes InterRetro, a method that interacts with the tree MDP, learns a value function for worst-path outcomes, and improves its policy through self-imitation. Experimental results show that InterRetro achieves state-of-the-art performance, solving 100% of the targets on the Retro*-190 benchmark, and also demonstrates strong results on two additional test sets.

优缺点分析

Strengths:

The paper is well-written and presents a comprehensive and complete study.
The idea of worst-path optimization is quite interesting. It offers a new perspective for addressing the retrosynthesis planning problem.
The experimental results are compelling and demonstrate significant improvements compared to prior algorithms.

Weaknesses:

The clarity of some descriptions could be improved. For example, the deterministic transition function is a mapping from (state, action) to to a set of reactant molecules. Each molecule corresponds to a single state, so transitioning from one state to a set of states is not a well-defined formulation.
The experimental part needs to be supplemented with more details. For specific points, please refer to the Questions part.
The significance of the paper needs to be further elaborated. In retrosynthesis planning, real-time feedback is not typically required, and taking several minutes to find a feasible solution is generally acceptable. The quality of the solution is more critical. Therefore, the claim that time-consuming real-time search limits practical utility in large-scale molecular design scenarios warrants further scrutiny.

问题

The paper mentions hundreds or thousands of model calls per molecule; however, previous retrosynthesis algorithms tested on the Retro*-190 dataset achieve a success rate of 99% with an average number of calls fewer than 50 [1] [2], and significantly reducing the number of model calls can also be achieved by caching the results of previous single-step retrosynthesis model calls to avoid redundant computations.. Please kindly verify this point again. [1] Yu,Y.etal.Grasp:Navigating retrosynthetic planning with goal-driven policy. In Advances in Neural Information Processing Systems (NeurIPS 2022) [2] Xie,S.etal.Retrograph:Retrosynthetic planning with graph search.In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2120–2129 (ACM, 2022).
$\gamma^T$ is employed to penalize deeper decompositions; however, if the target molecule requires a relatively long synthesis pathway, could this negatively impact the search efficiency? Specifically, might the search become biased towards shorter paths and spend excessive effort exploring those regions?
Graph2Edits serves as the single-step retrosynthesis model in your approach. Could you please explain how it is combined or integrated with the trained policy during the planning or search process?
In table 1, success of Retro* with template-based single-step model is $75.26\%$ when the number of model calls is limited to $500$ , which is different with the result displayed in the paper of Retro*. Could you please provide an explanation?
What is the average number of model calls for the algorithm? Could this information be included in a table for clearer comparison?
This paper optimizes for the shortest path to generate feasible synthesis routes. However, retrosynthesis involves various optimization objectives, such as the cost of chemical reactions. For example, Retro* approximates cost optimization by using the negative log of the policy function. How does the algorithm proposed in this paper incorporate such diverse optimization objectives?

If the authors can address the question I raised above, I would be willing to increase my score.

局限性

The main text of the paper does not discuss limitations or potential negative societal impacts; these aspects are only addressed in the appendix.

最终评判理由

After discussing with the authors, I have a better understanding of the training process and the motivation for the shortest path. However, I still have reservations about the significance of the problem itself. I have adjusted my score to 4, while remaining neutral on whether the paper should be accepted.

格式问题

The guideline for each question in the checklist should not be removed. There are no other formatting problems in this paper.

作者回复

2025-07-30

Thank you sincerely for your professional and constructive feedback. We are pleased to receive your comments, which have helped us improve the manuscript significantly. We have analysed and addressed each point below, and we hope our responses satisfactorily resolve your concerns.

W1. In line 141, we define the transition function as a mapping from the state–action space to the power set of the state space, $\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow 2^{\mathcal{S}}$ , in line with your suggestion. We have revised the presentation to improve clarity.

W2. We have revised the experimental section accordingly; please see our response to the Questions below.

W3. We agree that the quality of the synthetic route is essential in experimental settings. However, our emphasis on computational efficiency reflects the growing demand for scalable retrosynthesis planning, not a compromise on quality. In AI-guided drug discovery, retrosynthesis planning is increasingly deployed in batch settings to evaluate large numbers of candidate molecules generated by models such as LLMs. These pipelines often screen tens of thousands of compounds, where each must be filtered or ranked based on synthesizability. In such settings, latency per molecule quickly becomes a bottleneck. Furthermore, human-in-the-loop platforms (e.g., IBM RXN and ASKCOS) allow chemists to interactively sketch molecules and receive immediate feedback on the molecule's synthesizability. These real-time applications require short latency, enabling seamless user interaction and rapid design iteration. Such fast feedback is critical in practical workflows, where chemists often explore many variants of a lead compound in a single session. Therefore, in modern retrosynthesis applications, short inference times are not merely desirable but often necessary. When route quality is comparable, computational efficiency becomes the key to making retrosynthesis planning actionable at scale.

Q1. The heuristic search algorithms mentioned in the background paragraph require substantial model calls: Retro* typically exceeds 150, MCTS around 250, and Greedy DFS around 300. More details are provided in our response to Question 5. we agree that describing them as "thousands" is inaccurate; to clarify this, we have updated the text to: “These methods heavily rely on time-consuming real-time search, often requiring hundreds of model calls for each molecule...”

Q2. Thank you for highlighting this concern. The $\gamma^{T}$ term serves only to discourage unnecessarily deep decompositions: its effect is relative to the alternatives available for a given molecule. A short path that fails to reach purchasable building blocks receives a return of 0, whereas a longer but successful path still earns a positive return $\gamma^{T}>0$ ; consequently, our policy will always favour the viable route, even if it is longer. Among successful routes, $\gamma^{T}$ then biases the planner toward the shallower option, cutting out extra steps while keeping the route feasible. Moreover, because the worst‑path objective scores each synthesis tree by its weakest branch, any tree containing a dead‑end is automatically rejected, preventing the search from being “trapped” in short but invalid regions. Therefore, the depth penalty prunes gratuitous exploration while still selecting longer routes whenever they are actually required.

Q3. In InterRetro, we initialise a pretrained Graph2Edits model ( $\pi_\theta$ in Algorithm 1), and finetune it using self-imitation learning until convergence. During the search process, this finetuned Graph2Edits model is used as the one-step expansion model to generate reactants. We do not combine or integrate it with any other trained policy. The retrosynthesis search process, like Retro* (Chen et al., 2020) and MCTS (Hong et al., 2023), typically consists of three steps: selection, expansion, and update. Our finetuned Graph2Edits model is used exclusively in the expansion step, where it predicts a sequence of edits to decompose a molecule into its reactants.

Q4. Thank you for pointing this out. We mistakenly reported the results for a model-call budget of 400 (Liu et al. 2023). We have now thoroughly reviewed all other reported results to ensure their accuracy.

Q5. Thank you for your suggestion. Below, we compare our method with the suggested RetroGraph and other relevant algorithms:

Method	Avg. Model Calls
Greedy DFS	300.56
MCTS	254.32
Retro*	209.86
RetroGraph	45.13
PDVN-Retro*	30.94
InterRetro-Retro*	30.83

To compute the average model calls, we set the maximum model-call budget to 500 and measured the average number of calls required by each algorithm to successfully construct a synthesis route for molecules in the Retro*-190 dataset. We find that InterRetro and PDVN both achieve state-of-the-art performance in terms of model-call efficiency. These results will be included in the revised version of the paper.

Q6. We acknowledge the importance of a retrosynthesis planning algorithm to support various optimisation objectives. In this case, we could incorporate new quantities into the weighting function and encourage the policy to learn those low cost reactions. The cost of chemical reactions may come from the reaction conditions, the price of the reactants, etc. To the best of our knowledge, a comprehensive dataset capturing such information is not yet available. We would be happy to extend InterRetro in this direction once the necessary infrastructure and data become available.

Limitations. In the revised version, we have incorporated the limitations and the potential negative societal impacts in Appendix to the last section.

Paper Formatting Concerns. We have added the guidelines in the revised version.

We would like to thank you again for raising important questions about our submission. Due to this year's rebuttal policy, we cannot upload a revised PDF to reflect these changes. However, we believe that after incorporating the valuable feedback our current version becomes much stronger. Have we sufficiently addressed the main concerns? Please feel free to let us know if there are additional concerns or questions.

References

Chen, Binghong, et al. Retro*: Learning retrosynthetic planning with neural guided A* search. ICML, 2020.

Hong, Siqi, et al. Retrosynthetic planning with experience-guided Monte Carlo tree search. Communications Chemistry, 2023.

Liu, Guoqing, et al. Retrosynthetic planning with dual value networks. ICML, 2023.

2025-08-01

Thank you for the response. I have two remaining questions concerning the paper.

In the answer of Q2, I understand that unsuccessful shorter paths are discarded, and longer but feasible solutions can be identified. My concern lies in the number of infeasible shorter paths explored prior to identifying a longer feasible one, as this directly relates to the search efficiency of the algorithm. Taking the extreme case of breadth-first search as an example, although the shortest successful path can be found, all paths shorter than this successful one must be expanded earlier, leading to very low efficiency.
I have additional points regarding the significance of the paper that I would like to discuss with the authors. In human-in-the-loop platforms, what is the relative time spent on laboratory experiments versus retrosynthesis planning algorithms, and would efficiency gains in the latter yield a significant impact on the overall workflow time? Retrosynthesis algorithms usually require only a few minutes to execute, and in my view, further optimization of this runtime may be less impactful relative to the time consumed by chemical experiments. As for the AI-guided drug discovery, a model is usually trained to predict synthesizability scores without necessarily identifying an explicit feasible synthesis route. Retrosynthesis planning is performed only after filtering candidates using the synthesizability scoring model and other metrics.

2025-08-01

Thank you for bringing up these concerns. We’re glad to have the chance to address them in this discussion.

Q1. During search, each step in an unsuccessful path cannot receive any informative reward, offering little guidance and making the search inefficient. This issue is common in retrosynthesis search algorithms such as MCTS and Retro*. For example, in PDVN-MCTS, each step in an unsuccessful path is assigned a reward of $-1$ , which provides insufficient learning signal and tends to prioritise shorter paths. To address this inefficiency, we train a stronger one-step model to guide node expansion more effectively and narrow the search space. In our experiments, InterRetro expands each molecule with only $10$ plausible actions, compared to $50$ in template-based methods such as PDVN. As a result, the search tree generated by InterRetro is significantly smaller, which directly improves efficiency by avoiding unnecessary exploration of shallow but infeasible paths.

Q2. We provide our responses to the two questions as follows:

Human-in-the-loop platform. We would like to clarify that a human-in-the-loop platform allows chemists to interactively sketch molecules and receive immediate feedback on the molecule's synthetic routes. This workflow typically involves three steps: (1) drawing or editing a target (or intermediate) molecule on screen; (2) receiving a newly suggested full synthetic route; and (3) accepting, rejecting, or further modifying nodes in the synthetic tree. Our proposed method accelerates the second step — suggesting a new route almost instantly, thereby keeping the chemists’ creative flow uninterrupted.

For your question, we agree that laboratory experiments can take days to weeks, which is significantly longer than the time required by retrosynthesis planning algorithms. However, improving the efficiency of the overall design-and-experiment cycle is a systemic challenge that extends beyond retrosynthesis planning alone and is not the primary objective of our proposed algorithm.

Synthesizability score. We agree that synthesizability scoring models are valuable for fast inference. However, quantifying synthesizability by explicitly examining synthetic routes remains an important and complementary approach (Liu et al., 2017; Parrot et al., 2023). In particular, Skoraczyński et al. (2023) investigated whether synthetic accessibility scores can reliably predict the outcomes of retrosynthesis planning. The best-performing methods in their study achieved an AUC of 0.90 and an accuracy of 0.81, indicating that many ponential molecules could still be missed during screening. Given InterRetro’s efficient and realistic route generation, we believe it can contribute meaningfully to more accurate synthesizability evaluation.

Thank you for attending the discussion. We’d be grateful for any further thoughts or feedback you might share.

References

Liu, Bowen, et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Science, 2017.

Skoraczyński, Grzegorz, et al. Critical assessment of synthetic accessibility scores in computer-assisted synthesis planning. Journal of Cheminformatics, 2023.

Parrot, Maud, et al. Integrating synthetic accessibility with AI-based generative drug design. Journal of Cheminformatics, 2023.

2025-08-04

Thank you for your response. In Algorithm 1, the non-building blocks are added to $q$ , and in each iteration a molecule is popped from $q$ . Since each molecule has 10 plausible actions, each expansion generates at least 10 non-building blocks that require further expansion. If max_steps is set to 500, the maximum depth of the search tree in this case is 4. Under such circumstances, the feasible solution paths in the training set will not be very long. Given that some molecules in USPTO-190 require at least ten steps to synthesize, with breadth-first expansion strategy in Algorithm 1, such molecules would require on the order of $10^9$ expansions to be successfully synthesized during training. How is this issue addressed? If the training set does not contain molecules with relatively long synthesis paths, would the learned guiding function mislead the search when synthesizing such long-path molecules in the test phase, thereby reducing the search efficiency?

2025-08-04

Thank you for your detailed question. You have identified a crucial point about our training procedure in Algorithm 1 that we are happy to clarify. Your analysis of the potential for combinatorial explosion is correct if the algorithm were performing a breadth-first search, but our training-time exploration works differently.

The confusion arises from a key distinction between our data generation procedure for training (EXPLORE) and the search algorithms (like MCTS/Retro*) used for inference.

Training-time exploration is greedy, not breadth-first. As you noted, non-building blocks are added to $q$ . However, when a molecule $s$ is popped from $q$ , we do not expand all plausible actions. Instead, as shown in Algorithm 1 (Lines 6-7), the current policy $\pi_\theta$ is used to sample only one action ( $a \sim \pi_\theta(\cdot|s)$ ). The tree is expanded with just this single reaction. This creates a sequential, greedy construction of a single synthetic route at a time, not a branching search tree. The computational cost is therefore linear with the number of decomposition steps, completely avoiding the exponential explosion you described.

Clarification on "10 plausible actions". The "10 plausible actions" figure you mentioned (from our other comments) refers to the search-based inference at test time, which is a standard approach to improve success rates. This does not apply to the EXPLORE procedure used for data generation during training, which, as stated above, only samples a single action per step.

How long synthesis paths are learned. This leads to your final question: if the exploration is greedy, how can the model learn long, complex routes? The key is the iterative nature of the InterRetro algorithm.

Initially, the policy is weak and may only generate short, successful sub-trees by chance.
The algorithm identifies these successful sub-trees within the partially built routes and stores the decisions made along those paths in the replay buffer.
The policy is then updated by imitating these successful past decisions. This is a bootstrapping process: by learning to solve simple (short) parts of a synthesis, the policy becomes progressively more effective.
In later training iterations, this stronger policy is much more likely to generate a longer sequence of correct actions, allowing it to discover and learn from complete, multi-step syntheses that were unreachable initially.

We hope this explanation clarifies how our training procedure remains computationally feasible while still enabling the model to learn the long and complex synthetic routes necessary for challenging targets. Thank you again for your diligence and insightful questions.

2025-08-05

Thank you for your reply. I don't have any new questions

2025-08-05

Dear Reviewer sMtG,

We hope our responses have clarified the issue and addressed your concern. We kindly ask if you would consider updating the rating. Any further input would be greatly appreciated.

2025-08-06

I don't have any further questions. I will revise the score accordingly.

2025-08-06

Thank you for attending the discussion and providing valuable suggestions. We will revise the submission to clarify the points we discussed in the thread. We truly appreciate your support.

审稿意见

评分: 4置信度: 42025-07-02

Here the authors introduce a new method (InterRetro) for multi-step retrosynthesis planning by introducing a worst-path objective that aims to improve the most infeasible path in a synthetic route. Overall I think the idea is strong and novel, and the paper is nice amd well-written and could be well-suited to this venue, but IMO the main limitation holding it back right now is that it is missing comparison to actual SOTA baselines for this task (e.g., AiZynthFinder or ASKCOS) and instead comparing against not the leading models for multi-step retrosynthesis prediction (then claiming SOTA). This would need to be improved for me to recommend the paper for acceptance.

优缺点分析

Code available (and lots of routes provided as supplementary material, a bit strange), which is a strong point, but the documentation on how to use the code, packages/environment, etc is lacking. This could be improved to improve the utitlity of the model
Seems comparing to neither AiZynth nor ASKCOS which IMO are the leading models for retrosynthesis prediction, especially when it comes to multi-step retrosynthesis planning.
I like the results illustrating the performance under different budgets (e.g., Figure 4a), it would be nice to also know how the number of calls compares to compute (e.g., if this was done on a single GPU, etc), and how this compares to other methods.
Limited details on the dataset used for training (and how evaluations were done), this should be improved. It is not sufficient to simply say USPTO-50k, since how the splits were created for training/validation/testing are also important. With this in mind, I would encourage the authors to look into the PaRoutes (https://doi.org/10.1039/D2DD00015F) dataset which has been nicely curated for this task.
Similarly, on the molecules selected for evaluation from the three different datasets (e.g., Retro*-190, ChEMBL-1000, GDB17-1000) not much is said, some justification needs to be provided here. For instance, it is not clear to me why GDB17 was selected at all, as these molecules to my knowledge are not particularly drug-like or representative of a chemical space in either drug discovery or materials science, so unclear why they are used to evaluate a synthesis tool (do not get me wrong, I think there is value in GDB17, just not for this task).

问题

How do the depths of the predicted synthesis trees compare to the real synthesis trees in the training data?
The authors mention that their models are better at generating shorter routes, which is good, but how good is the model at also predicting known routes? This is not something that has been evaluated in the paper and I think it is important (it is also separate from the success rate).
How was data selected for the dataset usage experiments (e.g., 1%, 5%, etc.)? Was this done randomly, and does this make sense?
How can it be that sometimes the single-step model is better than the multi-step model (e.g., for limited model calls?)? I feel like this could be better explained because to me this is an unexpected result.

局限性

Comparison to SOTA is not well done (see comments above).

最终评判理由

I have increased my score by 1 pt following the authors' thorough and thoughtful response to my recommended revisions surrounding baselines.

格式问题

n/a

作者回复

2025-07-30

Thank you for recognising the novelty of our idea and the quality of our presentation. We have carefully addressed your concerns point by point and revised the manuscript accordingly. We hope that our responses satisfactorily resolve the issues raised.

W1. The routes provided in the supplementary materials are the synthetic routes generated by InterRetro on the Retro*-190 dataset, as stated in Section 5.1. We have improved the documentation and updated the anonymous repository, including a requirements.txt file and a step-by-step usage example. We believe that incorporating your suggestions enhances the reproducibility of our work.

W2. Thank you for raising the important concerns regarding baselines. AiZynth (from AstraZeneca) and ASKCOS (from MIT) are software platforms that allow users to deploy their own retrosynthesis algorithms via a user-friendly interface; they do not introduce new algorithms themselves. By default, both platforms implement a template-based search algorithm, which corresponds to the Template-MCTS baseline included in our original manuscript (Table 1). InterRetro significantly outperforms this baseline (20.00% vs. 95.78% success rate in Direct Generation, and 62.63% vs. 100.00% success rate when the model call budget is 500).

To further address the concern and strengthen our SOTA claim, we have run more experiments comparing InterRetro against two recent compelling algorithms: DreamRetroer (Zhang et al., 2025) and RetroCaptioner (Liu et al., 2024). As shown in the following table, InterRetro demonstrates superior performance, particularly in the search-free Direct Generation (DG) setting. We believe this new comparison reinforces InterRetro's state-of-the-art status.

Method	Direct Generation	Model Call 100	Model Call 200	Model Call 500
DreamRetroer	32.10	91.57	94.73	95.26
RetroCaptioner	5.26	68.94	72.63	85.26
InterRetro	95.78	96.31	100.00	100.00

Due to the limited rebuttal period, we have focused on the Retro*-190 benchmark. Results for ChEMBL-1000 and GDB17-1000 will be included in the final version of the paper.

W3. A key advantage of our method is its computational efficiency at inference time. We benchmarked the wall-clock time on a single NVIDIA RTX A5000 GPU. Our results show that Direct Generation takes approximately 0.88 seconds per molecule, while a search with a 500-model-call budget requires 8.53 seconds. The total wall-clock time required to solve the 190 molecules in the Retro*-190 dataset is presented below, demonstrating that InterRetro solves target molecules much faster than recent algorithms.

Methods	Direct Generation	Model Call 100	Model Call 200	Model Call 500
DreamRetroer	67 min	95 min	104 min	116 min
RetroCaptioner	74 min	134 min	141 min	240 min
InterRetro	3 min	16 min	20 min	27 min

W4. Our training and evaluation protocol follows current methods such as PDVN (Liu et al., 2023) and Retro* (Chen et al., 2020). The training set provides target molecules for the model to learn how to construct valid synthetic trees. The trained models are then evaluated on the test set based on their success rate in constructing valid trees and the length of those routes. The training set consists of the shortest routes synthesizable using USPTO-50k reactions and eMolecules building blocks. The test set is constructed by extending route extraction, ensuring no overlap with the training set, and applying additional filtering to make the test set more challenging. More details can be found in Section 5.1.2 of Chen et al. (2020).

Thank you for recommending the PaRoutes dataset. We have tested InterRetro on the set-n1 and set-n5. Due to the large size of each set (10,000 molecules), we evaluated only the first 200 molecules in each. As shown in the table below, InterRetro shows consistently strong performance, as it does on Retro*-190, ChEMBL-1000, and GDB17-1000. This supports the objectivity and robustness of our chosen benchmarks.

Route Set	Methods	Direct Generation	Model Call 100	Model Call 200	Model Call 500
set-n1	InterRetro-Retro*	91.50	94.50	98.50	99.00
set-n5	InterRetro-Retro*	94.00	97.00	100.00	100.00

W5. We believe that testing on multiple molecule datasets is helpful to comprehensively evaluate each algorithm, thus we choose multiple distinct benchmarks - Retro*-190, ChEMBL-1000 and GDB17-1000. Retro*-190 dataset serves as the standard benchmark in the field, enabling direct comparison with prior work. ChEMBL-1000 from ChEMBL dataset (Zdrazil et al., 2023) is a manually curated collection of bioactive molecules exhibiting drug-like properties, making it an appropriate benchmark for retrosynthesis planning. Regarding GDB17-1000 (Ruddigkeit et al., 2012), it covers a size range common for typical drugs and lead compounds, featuring unusual ring systems and connectivity patterns that challenge generalisation capabilities. If a method performs well on both drug-like molecules (ChEMBL) and structurally diverse, synthetically challenging molecules (GDB17), it demonstrates strong generalisation beyond memorisation of common reaction patterns. Thus we believe that these three datasets are appropriate benchmarks for retrosynthesis planning. We have added this justification to Section 5.1 to clarify our evaluation strategy.

Q1. Our analysis on the Retro*-190 benchmark shows that InterRetro tends to find more direct routes than those in the dataset. Specifically, the valid routes from our search-free Direct Generation (DG) model have an average depth of $4.54$ , which is shorter than the dataset's average of $6.96$ . As shown in the table below, this trend of finding valid routes continues when a search budget is applied. Unfortunately, we cannot perform a similar analysis for the ChEMBL-1000 and GDB17-1000 benchmarks as they do not provide ground-truth synthesis routes.

Methods	Average Depth
Dataset Routes	6.96
InterRetro (DG)	4.54
InterRetro (100 calls)	3.21
InterRetro (200 calls)	3.03

Q2. While our primary objective is to discover valid synthetic routes, we agree that evaluating the model's ability to reproduce known routes is an important indicator of its feasibility. When planning with Retro* and maximal model call 500, 38.9% of the known routes appears in the search tree constructed by InterRetro, while 46.4% of the known routes appears in the search tree built by standard Graph2Edits. Although slightly lower than the standard Graph2Edits, our finetuned model is still effective at recovering the routes presented in the dataset. In our experiments, InterRetro can represent valid but genuinely different synthetic strategies, often shorter than the routes presented in the dataset (as shown in the supplementary materials). This suggests our model balances between learning from established chemistry and discovering potentially more efficient alternatives.

Q3. For the dataset usage experiments shown in Figure 4a, subsets of molecules were selected via random sampling from the USPTO-50k training data. To ensure robust results for each sampling ratio (1%, 2%, 5%, and 10%), we created three independent, randomly sampled subsets and trained a separate model on each. The performance reported in the figure is the average across these three runs. This process ensures our results are objective and robust to sampling variations.

Q4. We appreciate the reviewer pointing out this interesting result, which is indeed reasonable on closer inspection. The Direct Generation (DG) model is entirely focused: it makes a single, decisive prediction for the best reaction at each step. In contrast, a search algorithm like Retro* or MCTS expands a much wider tree by considering multiple (e.g., 10) possible reactions at each step. When the search budget (i.e., the number of model calls) is severely limited, the algorithm cannot explore these numerous branches deeply enough to reliably identify the optimal path, increasing the chance of selecting a suboptimal action compared to the focused DG approach.

We would like to thank you again for raising important questions about our submission. After incorporating this valuable feedback, we believe that our current version becomes much stronger. Have we sufficiently addressed the main concerns? Please feel free to let us know if there are additional concerns or questions.

References

Chen, Binghong, et al. Retro*: Learning retrosynthetic planning with neural guided A* search. ICML, 2020.

Liu, Guoqing, et al. Retrosynthetic planning with dual value networks. ICML, 2023.

Liu, Xiaoyi, et al. RetroCaptioner: Beyond attention in end-to-end retrosynthesis transformer via contrastively captioned learnable graph representation. Bioinformatics, 2024.

Ruddigkeit, Lars, et al. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling, 2012.

Zdrazil, Barbara, et al. The ChEMBL database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Research, 2023.

Zhang, Xuefeng, et al. A data-driven group retrosynthesis planning model inspired by neurosymbolic programming. Nature Communications, 2025.

2025-08-04

Dear Reviewer paFC,

Thank you for your valuable suggestions on our manuscript. We have carefully addressed your concerns and incorporated your feedback into the revised version. We kindly ask if you would consider updating your review score in light of these changes. Any further input would be greatly appreciated.

2025-08-04

Thank you to the authors for the thoughtful revision.

However, I disagree with the following claim regarding current baselines like AiZynth/ASKCOS and why they have not been benchmarked against:

By default, both platforms implement a template-based search algorithm, which corresponds to the Template-MCTS baseline included in our original manuscript (Table 1). InterRetro significantly outperforms this baseline (20.00% vs. 95.78% success rate

The success rate is much higher than this for AiZynth/ASKCOS, as I know from (very recent) personal experience with these models since we have benchmarked recent methods from my team against these. Even in the 2020 paper, AiZynthFinder reports a much higher success rate for both ASKCOS and AiZynth: https://chemrxiv.org/engage/chemrxiv/article-details/60c74c60bdbb891499a39777

Am I misunderstanding something?

2025-08-04

Thank you for your thoughtful follow-up and for prompting this important discussion on baselines. You are correct that modern retrosynthesis platforms can achieve high success rates. We believe the key reference for this comparison is the 2020 AiZynthFinder study by Genheden et al.

In that paper, the authors perform a "rough baseline comparison" of their open-source AiZynthFinder against the public ASKCOS web server. The results are highly relevant:

ASKCOS, using the extensive, commercial Reaxys reaction database and a commercial stock from Sigma/eMolecules, achieved a 62% success rate.
AiZynthFinder, using the public USPTO reaction database and a stock derived from the ZINC database, achieved a 54% success rate.

This context is crucial for understanding our work. Our Template-MCTS baseline uses the same public USPTO dataset. With a 500-model-call budget, it achieves a 62.63% success rate on the Retro*-190 benchmark. This result is remarkably consistent with the success rate of 54 and 62% reported in the paper using public data under similar search conditions. This validates our baseline as a representative and fair implementation.

The 20% figure you noted corresponds to this baseline's performance in the much more challenging Direct Generation (search-free) setting, which naturally has a lower success rate.

The most critical point, however, is the performance improvement our learning framework delivers. When we apply our InterRetro training framework to a strong, template-free model like Graph2Edits, we see a dramatic performance lift. The Graph2Edits model achieves a 56.31% success rate on its own when paired with the Retro* search algorithm. After being fine-tuned with our worst-path objective, its performance is elevated to 100% under the same search budget. To further solidify our SOTA claim against the most current academic research, we also benchmarked against recent algorithms like DreamRetroer (2025) and RetroCaptioner (2024), where InterRetro again demonstrates superior performance.

Thank you again for ensuring this level of precision. We will revise the manuscript to include this specific context and clarify how our baseline's performance aligns with the findings in the AiZynthFinder paper.

2025-08-06

Dear Reviewer paFC,

We hope our responses have helped clarify the issue regarding the success rate and have adequately addressed your concerns. We kindly ask if you could consider updating your final rating. Please feel free to let us know if you have any further questions.

审稿意见

评分: 4置信度: 42025-07-02

This paper introduces InterRetro, a novel approach that frames multi-step retrosynthesis planning as a worst-path optimization in tree-structured MDP. Unlike traditional methods that optimize for average or cumulative rewards, InterRetro focuses on maximizing the value of the most challenging path, ensuring all branches terminate successfully.

优缺点分析

Strengths: The proposed worst-path optimization is intuitive while achieving strong empirical results. Besides, The authors provide rigorous theoretical guarantees, including proofs for the existence of a unique optimal solution and assurances of monotonic policy improvement with their weighted self-imitation learning algorithm.

Weaknesses: My main concern regarding this paper is which component contributes to the empirical performance improvement. While InterRetro proposes a novel objective and training methodology, it utilizes Graph2Edits as its underlying single-step model. Graph2Edits is a high-performing single-step retrosynthesis model itself, and its inherent strength might contribute significantly to the overall success. Without an ablation study specifically varying the single-step model while keeping the worst-path objective and self-imitation learning algorithm consistent, it is difficult to fully ascertain whether the improvements are primarily from the novel objective and learning paradigm or from the strong Graph2Edits base.

问题

What is the insight for choosing Graph2Edits as the single-step model?
Can you ablate your method by replacing the single-step model with the pre-trained MLP from retro*? This can help understand the relative importance of the design made in the paper.

局限性

yes

最终评判理由

In my initial review, I noted that the source of the performance improvement was unclear. The authors' rebuttal addressed this, clarifying that the gains are primarily attributed to the single-step model rather than the worst-case objective. As the authors have agreed to incorporate this important clarification into the final version, I will maintain my score.

格式问题

no formatting issue is found

作者回复

2025-07-31

Weaknesses. Thank you for raising this important concern. We agree that isolating the source of performance improvements is crucial for properly attributing the effectiveness of our approach. To address this, we provide the following ablation results on the Retro*-190 dataset. The first two rows compare the performance of the original pretrained Graph2Edits model (used with Retro*) and the finetuned version (InterRetro). The substantial improvement in performance after finetuning demonstrates that the self-imitation learning process and the worst-path objective meaningfully enhance the base model’s effectiveness.

In the next two rows, we evaluate a template-based model (Template - Retro*) and its finetuned variant using the same training framework. Again, we observe a significant performance gain after fine-tuning, even though the underlying single-step model is not as strong as Graph2Edits. This indicates that our learning framework improves multi-step planning across different single-step models, not just Graph2Edits.

These results suggest that the performance gains stem from the proposed training methodology rather than solely relying on the strength of Graph2Edits.

Method	Direct Generation	Model Call 100	Model Call 200	Model Call 500
Graph2Edits - Retro*	16.84	41.05	50.00	56.31
Graph2Edits (finetuned) - Retro*	95.78	96.31	100.00	100.00
Template - Retro*	17.89	38.42	58.42	79.47
Template (finetuned) - Retro*	32.10	46.31	65.26	85.26

Q1. We chose Graph2Edits as our single‑step model for three practical reasons:

Better generalisation. Graph2Edits edits molecules at the bond‑ and atom‑level, so it can apply the same local transformation to many unseen but similar leaving groups. Template‑based methods, which store entire reaction patterns, often fail when an exact template match is missing.
Fast and lightweight. The Graph2Edits model requires less than 10 MB of memory, whereas a template-based method, due to the storage of template information, can consume up to 1 GB, which hinders its applicability in parallel computation. At inference time Graph2Edits model is far cheaper than large generative models (e.g., transformer–decoder architectures), which speeds up both training and route generation.
High‑quality support constraint. The Graph2Edits paper reports state-of-the-art single‑step accuracy and shows that its predicted edits follow well‑established reaction chemistry. By constraining our policy to this edit vocabulary, we preserve chemical plausibility while still allowing our advantage‑weighted updates to re‑rank the edits.

Q2. Thanks for your suggestions. We have reported the results in our response to the Weakness above. We have revised the submission to reflect the ablation results.

2025-08-04

Thank you for the detailed responses. The results you have reported reinforce my concern regarding the source of the performance improvement. It appears that the improvement is primarily attributable to the choice of the single-step model, rather than your proposed worst-path optimization objective. Specifically, when using the same template-based single-step model that is prevalent in prior work, your method does not seem to hold a clear advantage. Given this, and its inferior performance compared to baselines like self-improve and PDVN, how do you validate the contribution of the proposed worst-path optimization, which is presented as the main contribution of your work?

2025-08-05

Thank you for this insightful follow-up. You have correctly pointed out that when our framework is applied to a traditional template-based model, the final performance does not surpass leading methods like PDVN. We believe this observation clarifies the true nature of our contribution: our work enables a paradigm shift by leveraging the superior generalisation of recent models, rather than just incrementally improving older, template-based approaches.

Here is how we validate the contribution of the proposed worst-path optimization:

The generalisation limit of template-based models. The core issue with template-based models is their limited ability to generalise. Template-based methods, which store entire reaction patterns, often fail when an exact template match for a novel molecule is missing. They struggle to apply transformations to unseen structures, fundamentally capping their potential performance, regardless of the search or learning algorithm applied.

Synergy with flexible, generalisable models. The true innovation of our work is demonstrated when the worst-path framework is paired with a more flexible model like Graph2Edits.

Graph2Edits excels at generalisation. Because it edits molecules at the bond- and atom-level, it can apply the same local transformation to many unseen but similar leaving groups.
Our training framework is uniquely suited to leverage this. It learns a policy on how to combine these primitive, generalisable edits to achieve a multi-step goal.
The evidence for this synergy is the dramatic performance lift shown in our ablation study: our framework improves the base Graph2Edits model's success rate from 56.31% to 100%. This immense gain is directly attributable to our training objective, as it is the only variable being changed in that experiment.

Therefore, the core contribution is demonstrating that our worst-path objective, when combined with a generalisable model, creates a state-of-the-art system that surpasses the limitations of the previous template-based one-step model.

We agree that our presentation can be improved to make this distinction clearer. We will revise the manuscript to better highlight this synergistic relationship and frame our contribution around this insight.

2025-08-08

Thanks to the author for the explanation. It is a good practice to make the individual contribution of the single-step model and the worst-case objective explicit in the paper. I do not have further questions and keep my score.

最终决定Accept (poster)

2025-09-17

The authors present InterRetro, a novel framework for multi-step retrosynthesis planning that reframes the problem as a worst-path optimization within tree-structured Markov Decision Processes (MDPs).

Reviewers agreed on the originality and justification of the proposed worst-path optimization. While initial concerns were raised regarding the absence of comparisons to certain state-of-the-art baselines and the precise attribution of performance improvements, the authors engaged extensively and effectively during the rebuttal period. They provided detailed explanations, additional experimental results, and prompt responses that comprehensively addressed these points. After the rebuttal session, reviews are generally positive about the responses.

Given the novel and well-supported methodology and the compelling experimental results, this submission deserves an accept. I would encourage the author to take the reviewers' comments into consideration to properly attribute the performance improvement to the components, and clarify the technical details in the revision.