Retro-R1: LLM-based Agentic Retrosynthesis
An LLM-based retrosynthesis planning agent trained end-to-end with Agentic Reinforcement Learning.
摘要
评审与讨论
The paper presents a novel method for multi-step retrosynthesis planning - RETRO-R1. The retrosynthesis planning is formulated as a sequential decision-making task, where the LLM agent takes actions based on environmental feedback. RETRO-R1 uses a plug-in single-step retrosynthesis model as a tool and builds a global graph to ensure iterative planning strategy. The authors demonstrate that the framework achieves SOTA or highly competitive performance compared to other retrosynthesis planning baselines.
优缺点分析
Strengths:
- Instead of using LLMs in a standard QA format, RETRO-R1 formulates retrosynthesis as a sequential decision-making process involving tool-use and environment feedback. This idea is quite original.
- While prior retrosynthesis work has used reinforcement learning, training a general-purpose LLM with end-to-end RL is new in this domain.
- The authors provide experiments across multiple baselines and datasets as well as ablation studies to outline strong empirical results.
- Reproducibility is addressed via github links and thorough description of implementation details.
Weaknesses:
- Figure 1 seems to be difficult to understand. The explanation provided in the text is a bit chaotic and only refers to Training and Planning parts leaving other parts of the picture untouched.
- There might be a typo in Figure 3 caption. Perhaps, BFS should be DFS.
- Some architectural aspects seem to come out of nowhere. E.g., the design of the reward function, the choice of Qwen LLM as a policy network, the choice of a template-based single-step retrosynthesis model. It would be interesting to see the motivation behind these choices as well as ablation studies covering these aspects.
- This paper https://arxiv.org/pdf/2501.08897 is not cited, although the agentic approach is very similar to what the authors of the current paper are proposing.
- Pass@1 success rate is an important metric for evaluation, however it might not take into account the ability of the model to discover multiple viable routes, which is essential in retrosynthesis planning. Alternative metrics like top-k success rate could provide a more complete picture.
问题
- What are the underlying reasons for a specific reward design mentioned in the paper?
- Why was Qwen model chosen as the policy network across all the other available LLMs?
- Why was this exact template-based MLP chosen as the single-step model?
- Are the authors familiar with this paper? https://arxiv.org/pdf/2501.08897 What are the principle differences between this work and Retro-R1?
- Is it possible to calculate top-k success rate metrics and include them in Table 1?
局限性
Yes
最终评判理由
After reviewing the rebuttals and the paper again, I decided to raise my evaluation but lower my confidence score. On one hand, the authors sufficiently addressed my questions. Although I am not convinced by some of the design choices and arguments, I think overall the contribution is fair and the supporting evidence is sufficient. On the other hand, I agree with another reviewer that the practical implications of this application study are limited. First, the approach is very computationally expensive. Second, I find no evidence of wet-lab validation, which would be required to assess the utility of the proposed agent. Therefore, I am less sure about this evaluation.
The authors missed the rebuttal deadline and submitted all answers late as official comments. As per NeurIPS instructions, we have to ignore all submitted materials. I, personally, see no reason to ignore them, so I haven't.
格式问题
None
Dear Reviewer:
Thank you for your valuable questions and feedback. We sincerely apologize for missing yesterday's rebuttal deadline. We will now submit our response as an official comment. We kindly ask for your understanding and forgiveness for this oversight.
Question1: Figure 1 seems to be difficult to understand. The explanation provided in the text is a bit chaotic and only refers to Training and Planning parts leaving other parts of the picture untouched.
Response: Thank you for your valuable suggestion. We agree that Figure 1 needs improvement and will redraw it in the revised version of the paper. Here’s a re-explanation of the figure to clarify its components and flow:
The top right of the figure illustrates how our agent interacts with the environment using two key actions: CALL and SELECT.
The CALL action invokes the single-step retrosynthesis model, which provides a list of candidate reactions for a given molecule.
The SELECT action allows the agent to choose one of these candidate reactions, transforming the current molecule into simpler precursor molecules.
The Molecule Store represents a collection of commercially available molecules. The environment automatically checks if each molecule is in this store and informs the agent of its availability through availabile/unavailabile labels.
The left side of the figure and the red dashed box depict the training process. We design an extended PPO algorithm for multi-round tool calling. In each training iteration, a molecule (e.g., mol1) is randomly selected from the training set and fed into the model. The model then interacts with the environment through sequential search, synthesizing the molecule by alternating between molecule nodes and reaction nodes. Once a termination condition is met, the entire multi-round dialogue history forms an "episode." The trainer then calculates a rule-based reward for this episode and updates the model parameters accordingly.
The black box outlines the planning process. During planning, the model performs multiple sequential searches. At each decision point, it checks the historical records for similar entries. If a match is found, the historical sequence of molecules/reactions is re-ordered to allow the model to re-evaluate its decision. If the decision remains consistent, the process moves sequentially to the next node, performing the same action. If the decision differs, a new node is generated, expanding the search space. After the budget is exhausted, all search results form a complete synthesis graph. A Breadth-First Search (BFS) is then applied to this graph to find the shortest route.
Question2: There is a typo in Figure 3 caption. Perhaps, BFS should be DFS.
Response: Thank you for helping us find the problem in the paper. We did not describe the role of BFS in detail, which may lead to misguidance. In Figure 3, the synthesis graph is a directed graph where all edges have a weight of 1. M1 represents the initial node, and M11 is the successful termination node (an empty node). Our goal is to find the shortest route from M1 to M11. For this, we are indeed using Breadth-First Search (BFS).
Here's how it works:
Starting from M1, one step can reach nodes (M2, M9).
Two steps can reach nodes (M3, M6, M10).
Three steps can reach nodes (M11, M7, M4).
Therefore, the shortest synthesis route contains 3 reactions, as depicted and found via BFS. We will ensure the caption clearly reflects this.
Question3: What are the underlying reasons for a specific reward design mentioned in the paper?
Response: Thank you for raising this quesion. The design of reward is a crucial point in our paper. The overall goal of our reward design is to help the model successfully find shorter routes. Therefore, it is essential that successful outcomes yield a positive reward, while failures result in a negative reward. Furthermore, among successful outcomes, there should be a clear distinction between longer and shorter routes.
For failures, we directly set the reward to -1.0.
For successful outcomes, considering that all reference route lengths in our dataset are within 20 steps, this implies that the model can succeed within 40 interactions (where two interactions correspond to one reaction step). In this scenario, we do not impose a penalty. During training, we set an upper limit of 60 interactions, aiming for a clear positive feedback for successes within this limit.
Specifically, when the number of interactions s is less than 40, the reward is set to 0.9. When s > 40, the reward is 0.9 - (s - 40) * 0.02. At the upper limit s = 60, the reward is 0.5. This design ensures that the reward system incentivizes finding shorter routes and provides a clear positive signal for successful completions within the defined interaction budget.
Question4: Why was Qwen model chosen as the policy network across all the other available LLMs?
Response: We selected the Qwen model because, according to OpenCompass [1], prior to 2025,
qwen2.5-7b-instructwas considered the most powerful model with parameters not exceeding 10B, particularly excelling in reasoning capabilities. Given the demands of the multi-step retrosynthesis task on model reasoning ability and our available training resources, this represented our optimal model choice. In future work, we intend to utilize even more powerful reasoning models within the same parameter range.
[1] Contributors O C. OpenCompass: A universal evaluation platform for foundation models, 2023[J]. URL https://github. com/open-compass/opencompass, 2023, 6.
Question5: Why was this exact template-based MLP chosen as the single-step model?
Response: This specific template-based MLP was chosen as the single-step model primarily to ensure fair comparison with all the baselines, as they also utilize this model. This model generates reactions based on reaction templates, which guarantees a basic level of reasonableness and diversity in the reactions, making it widely accepted within the chemistry community. Furthermore, its fast inference speed facilitates efficient training and testing.
In contrast to baseline methods, the single-step model in Retro-R1 is a replaceable component. This is because our large language model, through reinforcement learning, learns chemical knowledge about molecules and reactions directly, rather than simply relying on the probability distributions predicted by the single-step model. Consequently, using a more reliable single-step model would directly enhance the overall performance of Retro-R1.
Question6: This paper https://arxiv.org/pdf/2501.08897 is not cited. What are the principle differences between this work and Retro-R1?
Thank you very much for bringing this excellent work to our attention. This paper describes a significant contribution: leveraging a large corpus of chemical literature to construct a vast reaction graph. This graph allows for the search of shortest, reliable, and experimentally validated synthesis routes, which is of great importance to chemists.
The principal difference between this work and Retro-R1 is in their primary focus. That paper is designed to plan synthesis routes for known molecules, while Retro-R1 is primarily intended for planning synthesis routes for novel molecules.
We will cite this paper in our revised manuscript and explore ways to combine its strengths with Retro-R1, aiming to improve the reliability of our planned synthetic routes while maintaining our ability to handle new molecules.
Question7: Is it possible to calculate top-k success rate metrics and include them in Table 1?
Response: Thank you for this valuable suggestion. Top-k accuracy is indeed a very important metric for many scientific problems. However, in our task, there are a couple of key challenges: first, a molecule can have a wide variety of possible synthesis routes. Second, the ground truth routes in our dataset are constructed from single-step reaction data from USPTO, and are not guaranteed to be the optimal routes. Therefore, it is very difficult to require that the top-k routes found by methods in table 1 align with the ground truth routes.
Considering that route diversity is often more important for molecular synthesis, as chemists can analyze and select from a variety of routes to gain new insights, we have focused our analysis on the diversity of the routes found by Retro-R1.
In Section 3.4 of our paper, we describe our method for constructing the synthesis graph. When the search budget is set to 500, this synthesis graph can contain up to 500 nodes. By traversing this graph, we can identify a large number of potential synthesis routes. We conducted an experiment on the Retro*-190 dataset with an iteration limit of 500. We saved the complete synthesis graph and then filtered out all synthesis routes with a length not exceeding 30. The distribution of the number of routes found for all molecules is as follows (190 molecules in total):
| Number of routes | Not Found | 0-10 | 10-100 | 100-1000 | 1000-10000 | 10000-100000 | >100000 |
|---|---|---|---|---|---|---|---|
| Number of Molecules | 26 | 11 | 26 | 38 | 54 | 23 | 12 |
To further illustrate the diversity of these routes, we extracted the sets of starting materials for each synthesis route and counted the number of unique sets. The distribution is as follows:
| Number of unique sets | Not Found | 0-10 | 10-100 | 100-1000 | 1000-10000 | >10000 |
|---|---|---|---|---|---|---|
| Number of Molecules | 26 | 14 | 25 | 49 | 50 | 26 |
As these distributions show, our Retro-R1 model can indeed find a diverse array of routes involving a large number of different starting materials. This provides chemists with a sufficiently large space for screening and experimentation, which we believe is a significant advantage in real-world applications.
Thank you to the authors for providing answers to the other questions.
While I do not have further questions, I would like to make the point regarding the choice of the Qwen model. I strongly believe that presenting a new method (an LLM agent) using a single model with a justification that a single study (i.e. OpenCompass) positioned it among the leading reasoning models is an example of a poor study design.
Thank you for your comment. We understand your concern regarding the choice of a single base model to validate our new method.
We would like to clarify that our choice aligns with a common practice in the field of training reasoning models, where researchers often select a single, state-of-the-art base model to demonstrate a method's effectiveness. During the period of our research,
Qwen2.5was one of the most powerful and widely used models for such tasks.
For instance, contemporary studies have also utilized this model: [1] trained
Qwen2.5-32Busing DAPO; [2] fine-tunedQwen2.5-32B-Instructwith SFT while equipping it with budget forcing; and [3] trainedQwen2.5-0.5B-Instructusing StarPO.
Furthermore, the strong performance of
Qwen2.5is not only supported by OpenCompass but also corroborated by its high rankings on multiple other prominent leaderboards, such as LiveBench [4], llm-stats [5], and the LMSys Chatbot Arena [6].
Moving forward, we are committed to continually adopting the most advanced base models available to tackle scientific problems. We hope this response adequately addresses your concerns.
[1] Yu, Qiying, et al. "Dapo: An open-source llm reinforcement learning system at scale." arXiv preprint arXiv:2503.14476 (2025).
[2] Muennighoff, Niklas, et al. "s1: Simple test-time scaling." arXiv preprint arXiv:2501.19393 (2025).
[3] Wang, Zihan, et al. "Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning." arXiv preprint arXiv:2504.20073 (2025).
[4] White, Colin, et al. "LiveBench: A challenging, contamination-limited LLM benchmark." arXiv preprint arXiv:2406.19314 (2024).
[5] LLM Stats. "LLM Stats – Leaderboard and Metrics for Large Language Models." Project homepage: https://llm-stats.com.
[6] Chiang, Wei-Lin, et al. "Chatbot arena: An open platform for evaluating llms by human preference." Forty-first International Conference on Machine Learning. 2024.
Thank you to the authors for additional clarifications.
Thank you for taking the time to review our rebuttal. We sincerely hope that our clarifications have effectively addressed your concerns, and we would greatly appreciate your positive support in the final evaluation.
Thank you to the authors for clarifications on questions 1 and 2. The answer to question 3 is worrying. The design of the reward function, which is crucial for this work, appears as a set of empirical heuristics without proper justifications.
Thank you for your valuable comment.
Regarding the design of our reward function, we initially considered incorporating domain-specific chemical knowledge, for example, by using a model to evaluate reaction difficulty. However, no reliable evaluation functions for this purpose are currently available. Furthermore, introducing such complex, domain-specific constraints would have rendered our model's performance incomparable to existing baselines.
To design an effective reward function, we therefore adopted an empirically-driven approach, drawing inspiration from a wide range of literature and conducting extensive experiments. To provide context and justification for our approach, we highlight that heuristic reward design is a common and effective practice in related work. Here are a few examples:
[1] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
This paper assigns a adopt a rule-based reward system that mainly consists of two types of rewards: Accuracy rewards and Format rewards.
[2] Yu, Qiying, et al. "Dapo: An open-source llm reinforcement learning system at scale." arXiv preprint arXiv:2503.14476 (2025).
This paper utilizes a "Soft Overlong Punishment" to penalize excessively long outputs. This penalty is calculated based on the length of the generated sequence , a maximum allowed length , and a defined interval over which the reward linearly decreases .
R_{\text{length}}(y) = \begin{cases} 0, & |y| \le L_{\max} - L_{\text{cache}} \\ \frac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}}, & L_{\max} - L_{\text{cache}} < |y| \le L_{\max} \\ -1, & L_{\max} < |y| \end{cases}
>[3] Le, Hung, et al. "Coderl: Mastering code generation through pretrained models and deep reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 21314-21328. > >This paper employs a sophisticated reward function for code generation. A core part of its reward assigns values on the generated code $W^s$. >$$ r(W^s) = \begin{cases} -1.0 & , \text{if } W^s \text{ cannot be compiled (i.e. compile error)} \\\\ -0.6 & , \text{if } W^s \text{ cannot be executed with unit tests (i.e. runtime error)} \\\\ -0.3 & , \text{if } W^s \text{ failed any unit test} \\\\ +1.0 & , \text{if } W^s \text{ passed all unit tests} \end{cases}[4] Gehring, Jonas, et al. "Rlef: Grounding code llms in execution feedback with reinforcement learning." arXiv preprint arXiv:2410.02089 (2024). RLEF: GROUNDING CODE LLMS IN EXECUTION FEEDBACK WITH REINFORCEMENT LEARNING, In this paper, the LLM can execute generated code and get execution feedback.
The reward functions in all these prominent works are, in essence, carefully designed heuristics. Inspired by these approaches and through extensive experimentation, we developed the reward function presented in our paper, which we have empirically shown to be effective for the multi-step retrosynthesis task. We hope this explanation adequately addresses your concerns.
This paper introduces RETRO-R1, a novel framework that models multi-step retrosynthesis as a sequential decision-making process, where an LLM agent interacts with a chemical reaction environment. The LLM agent is trained using reinforcement learning to iteratively select unsolved molecules for decomposition and identify appropriate reactions using external tools. Leveraging the LLM's inherent reasoning and tool-use capabilities, the method demonstrates strong performance on established retrosynthesis benchmarks.
优缺点分析
Strengths:
- The integration of LLM agents to emulate a chemist's iterative discovery process in retrosynthesis is a compelling aspect of this work, enabling a data-driven approach to navigate the complex reaction space.
- The approach leverages the reasoning capabilities of state-of-the-art LLMs, combined with reinforcement learning (specifically PPO and GAE as described in the paper) for learning sequential decision-making with verifiable rewards.
Weaknesses:
- The overall clarity of the paper could be significantly improved. Several core concepts and their formalizations are either missing or insufficiently explained. For instance, a precise definition of the state representation for RL training, the mechanism of state transitions between molecule and reaction nodes, and the distinction/usage of "environment tokens" and "agent tokens" in the PPO updates (Section 3.3) are unclear.
- The paper lacks detail on the specific prompt templates used for the LLM agent, which are crucial for reproducibility and understanding the agent's interaction logic.
- While "R1" in the name suggests a focus on reasoning, the paper does not adequately demonstrate or analyze the LLM agent's emergent reasoning capabilities. How does the iterative planning strategy or RL training specifically shape the agent's "thinking"? A more detailed analysis or qualitative examples (e.g., thought process demos for challenging cases) would be beneficial.
- The reported computational cost (70 hours for training on 32x A800 80G GPUs) is substantial.
问题
- Given the emphasis implied by 'R1' in the name, what is the specific role of reasoning in RETRO-R1's performance? To what degree does it quantitatively affect the overall success rate or route quality? Can you provide any insights into how the "length" or "pattern" of the agent's reasoning (e.g., the sequence of thoughts or actions) changes over the course of training?
- Your multi-round planning strategy appears to utilize the full model call budget (e.g., up to 500 single-step model calls) before extracting the shortest route. In contrast, many baselines typically terminate once a feasible route is found. Could you clarify whether, for a fair comparison, all baselines were also allowed to exhaust their full iteration/model call budget during evaluation.
- What is the prompt templates used for the LLM agent's interactions with the environment and tools.
局限性
yes
最终评判理由
The rebuttal process confirmed my concerns regarding the manuscript's evaluation. The practical value of the proposed method remains questionable, as the marginal performance improvement does not appear to justify the significant computational cost. Since the authors promptly response to the comments, I would neither decrease nor increase the score.
格式问题
no formatting issue is found
Question1: The overall clarity of the paper could be significantly improved.
Response: We appreciate you highlighting the need for improved clarity. We agree and will significantly enhance these sections in our revised manuscript.
Agent Tokens vs. Environment Tokens: Agent tokens are those generated by LLM. Environment tokens are those generated by the environment to communicate node states to the LLM.
Reinforcement Learning State Representation: The state representation is a long sequence formed by the alternating appearance of agent tokens and environment tokens.
Alternating Molecular and Reaction Nodes: At a molecular node, the agent uses the CALL action to interact with a molecule, transitioning to a reaction node (representing potential reactions for that molecule). At a reaction node, the agent uses the SELECT action to choose a specific reaction, which then leads to a new molecular node (representing the precursors from that chosen reaction).
We believe these additions will provide a much clearer and more comprehensive understanding of our methodology.
Question2: What is the prompt templates used for the LLM agent's interactions with the environment and tools.
Response: We've provided instances of all the prompt templates used for the LLM agent's interactions with the environment, specifically demonstrating the synthesis of the complex molecule CC@HN1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S. These prompts form a multi-turn conversational format during the generation process.
The order is: System prompt -- User prompt -- [ LLM response1 -- Environment response1 -- LLM response2 -- Environment response2 ] x N
The notation x N indicates that the content within the [] brackets will be repeated until a termination condition is met.
System prompt
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"name": "single_step_retro", "description": "Perform single step retrosynthesis for a molecule and return several possible reactions to synthesize it. Note that the reactions may be incorrect and each reactant is marked as 'available' or 'unavailable'. The unavailable molecules have to be synthesized further.", "parameters": {"type": "object", "properties": {"molecule": {"type": "string", "description": "The ID for the molecule to be synthesized. For example, 0-0 ."}}, "required": ["molecule"]}}
{"name": "select_reaction", "description": "Given several reactions to synthesize a molecule, use this tool to select one from them.", "parameters": {"type": "object", "properties": {"reaction": {"type": "string", "description": "The ID for the selected reaction. For example, 0-0-0 ."}}, "required": ["reaction"]}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
User prompt
You are a professional organic chemist, skilled at finding pathways to synthesize novel molecules. Given a target molecule, your task is to find a pathway where the target can be eventually synthesized using ingredient molecules, and each step of the pathway is a viable chemical reaction. The maximum number of reaction steps you can use is 30.
You can use the tools provided to you to find the pathway. You can use the tool as many times as you want.
You must first conduct reasoning inside <think>...</think>. Then use the tool call <tool_call>...</tool_call> to call the tool.
Output format for tool call:
<think>
...
</think>
<tool_call>
...
</tool_call>
The target molecule is C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S. Now start to search for the pathway.
Molecule state 0: Unsolved molecules:
Molecule 0-0: C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S
Now, 0 steps are used.
Please select the molecule with the highest synthesis difficulty from the unsolved molecules. Use the tool 'single_step_retro' to synthesize it.
LLM response1:
<think>
The molecule with the highest synthesis difficulty is 0-0 as it is the only one in the unsolved list.
</think>
<tool_call>
{"name": "single_step_retro", "arguments": {"molecule": "0-0"}}
</tool_call>
Environment response1: (Could you please see the specific reactions and thinking under reviewer aDxU due to the character limit? We are really sorry for that.)
<tool_response>
**please see under reviewer aDxU.**
Please determine which of these reactions are possible. Then use the tool 'select_reaction' to choose the best reaction whose reactants are available or not available but easy to synthesize.
</tool_response>
LLM response2:
<think>
**please see under reviewer aDxU.**
</think>
<tool_call>
{"name": "select_reaction", "arguments": {"reaction": "0-0-0"}}
</tool_call>
Environment response2:
<tool_response>
Successfully select reaction 0-0-0 to synthesize molecule C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S . The state goes to:
Molecule state 1: Unsolved molecules:
Molecule 1-0: C=CC[C@@H]1C(=S)N([C@H](C)c2ccccc2)C[C@@]1(C=C)C(=O)OC(C)(C)C
To reach this state, at least 1 reaction steps are used.
Please select the molecule with the highest synthesis difficulty from the unsolved molecules. Use the tool 'single_step_retro' to synthesize it.
</tool_response>
Question3: How does the iterative planning strategy or RL training specifically shape the agent's "thinking"? Can you provide any insights into how the "length" or "pattern" of the agent's reasoning changes over the course of training?
Response: In our response to Question 2, we provided an example of the LLM interacting with the environment, and we'll elaborate on that here to demonstrate its qualitative reasoning.
To synthesize the complex molecule
C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S, our LLM agent first calls the single-step retrosynthesis model. After receiving five candidate reactions from the environment, the LLM systematically analyzes and compares each reaction.
This process reveals distinct "thinking" behaviors shaped by the RL training:
The LLM receives five candidate reactions from the environment, and then analyzes and compares these reactions one by one. This analysis reveals interpretable reasoning:
Molecular Complexity: For instance, it identifies that the precursor in "reaction 0-0-0" is slightly simpler than the target molecule, while the precursor in "0-0-4" is not simpler. When visualized, "reaction 0-0-0" shows the precursor having an opened ring structure compared to the target, validating the LLM's judgment.
Reaction Plausibility/Optimality: For example, it deems the reaction in "0-0-1" as "not optimal." This suggests an ability to evaluate the quality or efficiency of a proposed reaction.
These examples clearly indicate that the LLM indeed comprehends molecules and chemical reactions. This stands in stark contrast to baseline methods, where neural networks directly predict a numerical value for synthetic difficulty. Such values are often heavily dependent on the training dataset and lack the inherent chemical understanding and generalization capability that our LLM exhibits. During the training process, the reasoning length of the agent gradually shortened, and the agent learned to express key information more concisely, which was consistent with the results of the concurrent work[1].
[1] Wang, Zihan, et al. "Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning." arXiv preprint arXiv:2504.20073 (2025).
Question4: The reported computational cost (70 hours for training on 32x A800 80G GPUs) is substantial.
Response: Thank you for pointing out this problem. We acknowledge that training LLms with RL is resource-intensive. However, our core objective is to develop a chemist agent powered by an LLM, capable of chemist-like reasoning and solving complex chemical problems. These LLMs can explicitly think and deliberate on each decision, which significantly boosts both interpretability and reliability.
RL is crucial for achieving this goal, as it dramatically enhances the LLM's reasoning abilities. It's also remarkably data-efficient; we achieved competitive performance using only 11,366 data items, compared to 299,202 used by baselines. Furthermore, LLMs possess strong generalization abilities, allowing their capabilities to be easily transferred to out-of-domain datasets, thereby enabling broader applicability.
Question5: Many baselines typically terminate once a feasible route is found. Could you clarify whether all baselines were also allowed to exhaust their full iteration/model call budget during evaluation.
Response: Thank you for this suggestion. All baselines are deterministic search algorithms that return the "best" route under their scoring rules and then stop. This means they don't fully use the allotted computational budget. To address this, we modified PDVN, the best-performing baseline. We changed its code so it continues to expand nodes after finding a route, using the full budget. We then found the shortest route within the resulting AND-OR tree, rather than relying on its internal scoring.
On retro*-190: PDVN has 90 shorter routes while Retro-R1 has 72.3. On ChEMBL-1000
| 5 | 10 | 15 | 20 | 25 | 30 | |
|---|---|---|---|---|---|---|
| PDVN | 66.6 | 77.7 | 80.1 | 80.6 | 80.6 | 80.6 |
| Retro-R1 | 65.4 | 77.9 | 81.2 | 82.5 | 83.2 | 83.7 |
PDVN's performance did improve, especially for finding more and shorter routes on in-domain datasets. However, on the out-of-domain ChEMBL-1000 dataset, PDVN still lags significantly behind Retro-R1, particularly for complex molecules requiring longer synthetic routes. This is because methods like PDVN are trained on data with route lengths not exceeding 15. This limits their ability to generalize to longer routes, causing them to search broadly rather than deeply.
Thanks for the detailed responses. After reading the responses, I am prone to keep my score given the following concerns:
- Fairness in the evaluation: As you have acknowledged that all baselines terminate once a feasible route is found, preventing them from searching for better routes. To fix this issue, all baseline methods need to be rerun. Furthermore, the 'shorter route' metric is not so informative, as it may be constrained by the success rate. A better practice is to count shorter routes only for those molecules that can be solved by all the baselines.
- Limited discussion of thinking behavior: The evolved thinking pattern is a crucial feature of the proposed method, differentiating from traditional methods, and deserves a thorough discussion in the manuscript.
- Huge computational cost: The proposed method consumes many times more computational cost than baselines while achieving marginal improvement. The authors are encouraged to demonstrate the value of the method by studying tasks for which traditional methods fall short, instead of proposing an alternative method where traditional methods are sufficient.
Question2: Limited discussion of thinking behavior: The evolved thinking pattern is a crucial feature of the proposed method, differentiating from traditional methods, and deserves a thorough discussion in the manuscript.
In our previous response, we provided case studies of the model's thinking process, which demonstrate its understanding of molecules and reactions. We will add these examples to the appendix of the revised manuscript.
<think>
Let me analyze each reaction option systematically:
1. Reaction 0-0-0: Direct precursor is unavailable and appears nearly as complex as the target. Not ideal.
2. Reaction 0-0-1: Uses an available phosphorus reagent but requires an unavailable oxime precursor. The reaction between oxime and phosphorus ylide is plausible but not optimal.
3. Reaction 0-0-2: Similar to 0-0-1 but with a different phosphorus reagent. Still requires the same unavailable oxime. No significant advantage.
4. Reaction 0-0-3: Forms the target via oxidation of a thiirane, but the precursor is unavailable and equally complex.
5. Reaction 0-0-4: Represents an elimination product, but precursor is unavailable and not simpler.
6. Reaction 0-0-5: Carbamate formation, but requires an unavailable carboxylic acid precursor.
Optimal Choice: Reaction 0-0-0 is the most reasonable despite an unavailable precursor because:
- It represents a direct transformation (likely a cycloaddition or migration)
- The unavailable precursor is only slightly simpler than target
- Other options either require multiple steps or have less reliable transformations
- The phosphorus reagents in other options are less practical
</think>
We would also like to clarify why the thinking process was not the central focus of our paper. Our work's primary contribution lies in the agent's design and training methodology, rather than a micro-analysis of its emergent reasoning paths. This approach is analogous to prior work in mathematical reasoning, such as [1,2], which also emphasizes the model's overall performance and training framework over a detailed analysis of the thought process itself.
We acknowledge that the model's current reasoning does not yet match the sophistication of an expert chemist, who can perform high-level, multi-step strategic planning. Achieving that level of expert-like thinking would likely require a significant investment in creating a large-scale dataset of expert-annotated thought chains for fine-tuning, similar to the efforts behind major achievements like OpenAI's IMO gold medal-winning model. Our work takes a critical first step by demonstrating that a large language model, trained with reinforcement learning, can surpass specialized small models on this highly complex task. We believe that future work incorporating expert-annotated data will lead to even more significant improvements.
[1] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).
[2] Zhang, Di, et al. "Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning." arXiv preprint arXiv:2410.02884 (2024).
Question3: Huge computational cost: The proposed method consumes many times more computational cost than baselines while achieving marginal improvement. The authors are encouraged to demonstrate the value of the method by studying tasks for which traditional methods fall short, instead of proposing an alternative method where traditional methods are sufficient.
We acknowledge that our large model-based approach has a higher computational cost than the specialized smaller models. However, it is important to consider the context of this challenging field. Because smaller models lack a true understanding of chemical principles, traditional methods have seen limited progress for many years. Planning molecular synthesis remains a major bottleneck in drug discovery, and the total synthesis of natural products is often considered an almost unreachable goal.
We view the current benchmark datasets as analogous to GSM8k (Grade School Math 8K) in the domain of mathematics. Surpassing traditional methods on these foundational benchmarks is a crucial and non-trivial first step. As we collect and annotate more high-quality data in the future, we are confident that the advantages of large language models in tackling these complex reasoning tasks will become even more pronounced and their value will be fully demonstrated.
Thank you again for your valuable suggestions. We hope that our responses have fully addressed your concerns.
Thanks the author for the detailed response and additional results. However, some of the them reinforce my concerns for the fairness of evaluation for the manuscript: (1) The evaluation of baselines is biased in that they do not use up the computation budget as Retro-R1 does, and (2) the original quality metrics overlaps with the success rate, making the results hard to interpret. Given these issues, the conclusions drawn from the experiments lack sufficient objectivity. Furthermore, the proposed method is computationally intensive, especially for a problem where traditional methods already perform quite well. This calls into question the practical significance of the paper's contribution to the retrosynthesis field. In light of these persistent concerns, I will maintain my original score.
Thank you for your reply. We would like to address your concerns regarding the experimental fairness and computational cost.
Regarding the experimental setup in our original paper, our primary conclusion—the improvement in success rate by Retro-R1—remains valid. For any failed molecule, all baseline methods would have necessarily exhausted their entire budget, making the comparison on this key metric fair.
Furthermore, concerning route length, as detailed in our previous response, we have already conducted a completely fair comparison. We modified the code for all baselines to ensure full budget utilization and compared route lengths only on the subset of molecules that all methods successfully solved. These results reaffirmed the advantage of Retro-R1, and we will incorporate these findings into the revised manuscript. Additionally, Figure 4 in our original paper highlights another key advantage of our method: its ability to find very long, viable synthesis routes for difficult molecules, a task where prior methods typically fail.
Regarding the computational cost, we believe the increased resource investment is justified considering the significant challenge of the retrosynthesis task and the immense potential of large language model-based approaches. While our current work presents comparisons on established datasets, we anticipate that the advantages of Retro-R1 will become even more pronounced when applied to more challenging and out-of-distribution datasets in the future.
Thank you once again for your suggestions, which will certainly help improve our paper. We hope this response fully addresses your remaining concerns, and we would be grateful if you would consider reassessing our work in light of these clarifications.
Thank you for your valuable comments. We are pleased that our previous response addressed most of your concerns. For the remaining points, our responses are as follows:
Question1: Fairness in the evaluation: As you have acknowledged that all baselines terminate once a feasible route is found, preventing them from searching for better routes. To fix this issue, all baseline methods need to be rerun. Furthermore, the 'shorter route' metric is not so informative, as it may be constrained by the success rate. A better practice is to count shorter routes only for those molecules that can be solved by all the baselines.
Thank you for this excellent suggestion for ensuring a fairer comparison. As you recommended, we have rerun the experiments with a revised evaluation protocol. All baseline methods were modified to fully utilize their 500-iteration budget before termination. Furthermore, to make the 'shorter route' metric more informative, we now only compare route lengths for molecules that were successfully solved by all baseline methods. The results are as follows:
On the Retro*-190 test set, 122 molecules were solved by all methods. The number of routes found by each method that were shorter than the ground truth is shown below:
| Retro*-0 | Retro* | Retro*+-0 | Retro*+ | EG-MCTS | PDVN | Retro-R1 | |
|---|---|---|---|---|---|---|---|
| Shorter routes | 54 | 59 | 59 | 61 | 55 | 68 | 66 |
On the ChEMBL-1000 test set, 716 molecules were solved by all methods. The success rates with respect to the route length limit are as follows:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Retro*-0 | 10.75 | 36.03 | 60.06 | 77.23 | 86.73 | 92.32 | 95.25 | 97.77 | 98.46 | 99.30 |
| Retro* | 10.75 | 35.75 | 60.20 | 78.21 | 88.13 | 92.46 | 96.37 | 97.91 | 98.46 | 99.30 |
| Retro*+ -0 | 10.75 | 35.61 | 60.06 | 77.65 | 87.15 | 93.02 | 95.53 | 98.04 | 98.74 | 99.44 |
| Retro*+ | 10.75 | 35.34 | 60.34 | 77.79 | 87.43 | 93.02 | 96.09 | 97.91 | 98.74 | 99.58 |
| EG-MCTS | 10.75 | 35.75 | 57.82 | 78.07 | 86.45 | 91.62 | 95.39 | 97.21 | 98.88 | 99.72 |
| PDVN | 10.75 | 35.89 | 62.99 | 81.98 | 90.36 | 94.97 | 97.21 | 98.32 | 99.44 | 99.58 |
| Retro-R1 | 10.75 | 35.89 | 62.99 | 82.82 | 91.48 | 95.53 | 97.91 | 98.60 | 99.44 | 99.58 |
Although this stricter comparison filters out most of the challenging molecules where Retro-R1's advantages are most pronounced, our method still demonstrates superior performance across most metrics compared to all baselines.
The authors use reinforcement learning to fine-tune an LLM agent to perform multi-step retrosynthesis planning.
优缺点分析
Strengths
- Novel application of RL+LLM for the retrosynthesis task
- The authors demonstrate strong empirical results with the LLM-based agent
- The authors for the most part present a clear and fair evaluation compared to baseline methods, utilizing the same single-step retrosynthesis model
Weaknesses
- It's not clear how necessary using an LLM is when the same single-step model is used--the results are not obviously better than PDVN, an RL-based method that doesn't use LLMs
- Additional evaluations beyond pass@1 and success rate at N are not presented. The number of routes found under a budget, for instance, is a very useful (arguably even more useful) evaluation in retrosynthesis.
- The comparisons are made by iterations of the single-step model, but the computational overhead of using the LLM is not presented. This is important for real-world deployment of CASP systems.
问题
In addition to addressing the above concerns:
- Can the authors account for the discrepancy between their reported results on PDVN and the results reported in the original PDVN paper? The original paper has higher values for the iterations shown.
- Can the authors perform additional evaluation / case studies / demonstrations for the benefit of using LLM in this use case? Are there any interpretable reasoning behaviors that are learned from the training? Is the model better at remembering how to break down a molecule that it has seen before (and perhaps reaching such molecules)?
- Can the authors compare to additional recent multi-step retrosynthesis methods? E.g. [1-3]
[1] https://www.nature.com/articles/s41467-024-55374-9 [2] RetroGraph https://arxiv.org/abs/2206.11477 [3] MEEA* https://www.nature.com/articles/s42004-024-01133-2
局限性
yes
最终评判理由
I do not believe that this work provides a very significant advancement in computer-aided retrosynthesis for domain scientists. However, the empirical results still hold merit, and the methodology is novel and one that seems important to try. As a result, I will retain my score of 4.
格式问题
none
Question1: It's not clear how necessary using an LLM is when the same single-step model is used.
Response: We appreciate this important question. Our core objective is to build a chemist agent capable of exhibiting chemist-like reasoning and solving complex chemical problems. Such an LLM can explicitly reason and deliberate on each decision, significantly enhancing both interpretability and reliability.
RL is crucial for improving the LLM's reasoning capabilities. It is highly data-efficient; we achieved competitive performance using only 11,366 data items, compared to 299,202 used by baselines. Furthermore, LLMs possess strong generalization abilities, allowing their capabilities to be easily transferred to out-of-domain datasets, thereby enabling broader applicability.
In this paper, we tackled the multi-step retrosynthesis task to demonstrate that an LLM, when trained with RL, can surpass specially designed small models and algorithms. Using the same single-step model was primarily for fair comparison with the baselines. However, unlike the baselines, the single-step model is a replaceable component in our system. Recent works such as naturelm and chemma have also shown that LLMs themselves can function as powerful single-step models.
Question2: Additional evaluations beyond pass@1 and success rate at N are not presented. The number of routes is a very useful evaluation.
Response: Thank you for this excellent point. We focused on pass@1 and success rate at N metrics primarily to align with previous work and facilitate fair comparisons. All baselines, such as Retro*, EG-MCTS, and PDVN, are designed to find only a single route, and as deterministic search algorithms, repeated runs yield identical routes, thus precluding metrics related to the number of routes found.
However, we agree that the ability to find multiple diverse routes under a given budget is highly valuable in retrosynthesis. In Section 3.4 of our paper, we describe our method for constructing the synthesis graph. When the search budget is set to 500, this synthesis graph can contain up to 500 nodes. By traversing this graph, we can identify a large number of potential synthesis routes for chemists to choose from.
To demonstrate this capability, we conducted an experiment on the Retro*-190 dataset with an iteration limit of 500. We saved the complete synthesis graph and then filtered out all synthesis routes with a length not exceeding 30. The distribution of the number of routes found for all molecules is as follows (190 molecules in total):
| Number of routes | Not Found | 0-10 | 10-100 | 100-1000 | 1000-10000 | 10000-100000 | >100000 |
|---|---|---|---|---|---|---|---|
| Number of Molecules | 26 | 11 | 26 | 38 | 54 | 23 | 12 |
To further illustrate the diversity of these routes, we extracted the sets of starting materials for each synthesis route and counted the number of unique sets. The distribution is as follows:
| Number of unique sets | Not Found | 0-10 | 10-100 | 100-1000 | 1000-10000 | >10000 |
|---|---|---|---|---|---|---|
| Number of Molecules | 26 | 14 | 25 | 49 | 50 | 26 |
As these distributions show, our Retro-R1 model can indeed find a diverse array of routes involving a large number of different starting materials. This provides chemists with a sufficiently large space for screening and experimentation, which we believe is a significant advantage in real-world applications.
Question3: The computational overhead of using the LLM is not presented.
Response: We appreciate you raising this important point. We acknowledge that the inference time for LLM-based methods is currently slower than that of smaller, specialized models. However, the core advantage of using a large model lies in its ability to provide an explicit reasoning process, leading to interpretable and reliable results that can directly assist chemists in decision-making.
In the real-world context of molecular synthesis, validating a synthetic route through wet-lab experiments often takes days, or even weeks. Compared to this significant experimental time, the computational overhead of LLM inference is relatively very small. Therefore, focusing on increasing the success rate of the planning process, even with a slightly longer computational time, offers immense value by significantly reducing costly and time-consuming experimental failures.
Question4: Can the authors account for the discrepancy between their reported results on PDVN and the results reported in the original PDVN paper?
Response: In Section 4.1 "Single-step Model", we describe four versions of single-step model weights (V1-V4). V1 was trained on real chemical reactions, while V2-V4 were trained on reactions found through a search process, which don't have experimental validation to guarantee their real-world feasibility.
We observed that some high-probability chemical reactions proposed by V4 have extremely low probabilities in V1. This can lead to routes that are not viable. To circumvent this issue and ensure a fair comparison across all methods, we chose to use V1 to propose k=10 candidate reactions. (In the original PDVN paper, k=50, which essentially included almost all possible reactions and provided no meaningful constraint.) Then we used V2-V4 to re-calculate and normalize the probabilities for these 10 reactions.
Question5: Can the authors perform additional demonstrations for the benefit of using LLM in this use case?
Response: We appreciate the opportunity to further elaborate on the benefits of using an LLM in this context and to demonstrate its interpretable reasoning behaviors. As discussed in our response to Question 1, our primary goal in using a large model is to build a chemist agent that exhibits chemist-like reasoning and can tackle complex chemical problems. To illustrate these capabilities, we show an example of the LLM thinking.
The LLM receives five candidate reactions from the environment, and then analyzes and compares these reactions one by one. This analysis reveals interpretable reasoning:
Molecular Complexity: For instance, it identifies that the precursor in "reaction 0-0-0" is slightly simpler than the target molecule, while the precursor in "0-0-4" is not simpler. When visualized, "reaction 0-0-0" shows the precursor having an opened ring structure compared to the target, validating the LLM's judgment.
Reaction Plausibility/Optimality: For example, it deems the reaction in "0-0-1" as "not optimal." This suggests an ability to evaluate the quality or efficiency of a proposed reaction.
These examples clearly indicate that the LLM indeed comprehends molecules and chemical reactions. This stands in stark contrast to baseline methods, where neural networks directly predict a numerical value for synthetic difficulty. Such values are often heavily dependent on the training dataset and lack the inherent chemical understanding and generalization capability that our LLM exhibits.
Environment:
<tool_response>
Reaction state 0-0: Possible reactions to synthesize molecule 0-0: C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S are as follows:
Reaction 0-0-0: C=CC[C@@H]1C(=S)N([C@H](C)c2ccccc2)C[C@@]1(C=C)C(=O)OC(C)(C)C (unavailable)
Reaction 0-0-1: COc1ccc(P2(=S)SP(=S)(c3ccc(OC)cc3)S2)cc1 (available) + C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=O (unavailable)
Reaction 0-0-2: C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=O (unavailable) + S=P12SP3(=S)SP(=S)(S1)SP(=S)(S2)S3 (available)
Reaction 0-0-3: C[C@H](c1ccccc1)N1C(=O)[C@]2(C(=O)OC(C)(C)C)C=CC[C@@H]2C1=S (unavailable)
Reaction 0-0-4: C[C@H](c1ccccc1)N1C[C@]2(C(=O)OC(C)(C)C)C#CC[C@@H]2C1=S (unavailable)
Reaction 0-0-5: CN(C)C(OC(C)(C)C)OC(C)(C)C (available) + C[C@H](c1ccccc1)N1C[C@]2(C(=O)O)C=CC[C@@H]2C1=S (unavailable)
</tool_response>
LLM:
<think>
Let me analyze each reaction option systematically:
1. Reaction 0-0-0: Direct precursor is unavailable and appears nearly as complex as the target. Not ideal.
2. Reaction 0-0-1: Uses an available phosphorus reagent but requires an unavailable oxime precursor. The reaction between oxime and phosphorus ylide is plausible but not optimal.
3. Reaction 0-0-2: Similar to 0-0-1 but with a different phosphorus reagent. Still requires the same unavailable oxime. No significant advantage.
4. Reaction 0-0-3: Forms the target via oxidation of a thiirane, but the precursor is unavailable and equally complex.
5. Reaction 0-0-4: Represents an elimination product, but precursor is unavailable and not simpler.
6. Reaction 0-0-5: Carbamate formation, but requires an unavailable carboxylic acid precursor.
Optimal Choice: Reaction 0-0-0 is the most reasonable despite an unavailable precursor because:
- It represents a direct transformation (likely a cycloaddition or migration)
- The unavailable precursor is only slightly simpler than target
- Other options either require multiple steps or have less reliable transformations
- The phosphorus reagents in other options are less practical
</think>
Question6: Can the authors compare to additional recent multi-step retrosynthesis methods?
Response: We appreciate your suggestion. We did attempt to compare with DreamRetroer[1]. However, we believe the truly relevant code for this work is not open-source. We examined their codebase but found it to be very similar to EG-MCTS, lacking the specific implementations corresponding to the "Abstraction" and "Dreaming" described in their paper. We tried to contact the authors via email but got no response.
Regarding RetroGraph[2], its code is not publicly available, preventing a direct comparison. The PDVN paper did include a comparison with RetroGraph, likely due to both works originating from the same organization.
For MEEA*[3], our comparison results are as follows:
| pass@1 | N=50 | N=100 | N=200 | N=300 | N=400 | N=500 | Shorter routes | |
|---|---|---|---|---|---|---|---|---|
| MEEA* | 44.74 | 62.11 | 73.16 | 81.58 | 83.68 | 84.74 | 86.32 | 50 |
| Retro-R1 | 55.79 | 73.21 | 77.21 | 82.58 | 84.47 | 85.89 | 86.95 | 72.3 |
Dear Reviewer,
Thank you again for your constructive feedback on our manuscript.
As the discussion deadline is approaching, we are keen to ensure that we have fully addressed all your concerns.
We would be grateful if you could let us know whether our explanations and experiments were sufficient to resolve the issues you identified. We are happy to provide more details if needed.
Thank you again for your time and engagement.
Best Wishes,
Authors
I thank the authors for the detailed responses. I have some fundamental concerns:
- The method is bottlenecked by a single-step model
- Evaluation still remains limited by the Retro-190 molecules set as well.
- The reasoning trace provided doesn't seem very sensical, and also hallucinates functional groups
- It thus doesn't seem clear why one would use an LLM over a traditional RL method, at least through the method the authors have shown.
- I don't agree with the authors' point that k = 50 provides no meaningful constraint, these template-based models tend to rank tens if not hundreds of thousands of templates
I do not believe that this work provides a very significant advancement in computer-aided retrosynthesis for domain scientists. However, the empirical results still hold some merit, and the methodology is one that seems to me to be important to try. As a result, I will retain my score of 4.
Thank you very much for your valuable comments. We will certainly prioritize them in our future work.
Specifically, enabling the large language model to directly perform single-step retrosynthesis to replace the specialized single-step model, and incorporating expert-annotated chains of thought to ensure the reasoning process is more chemically plausible, are excellent next steps. By addressing these points, our work will become more valuable to chemists.
We are very grateful for your positive assessment of our current methods and results, and we sincerely appreciate the supportive rating you have given our work.
This paper develops and evaluates an approach to RL training for LLM-guided retrosynthesis. The approach involves a loop in which the agent has a set of molecules it needs to synthesize (at first this is just the single target molecule), it chooses one of the molecules to synthesize, and it calls an off the shelf single-step retrosynthesis tool to propose possible single-step reactions for synthesizing the chosen molecule. The LLM then selects one of these reaction pathways, at which point the chosen molecule is removed from the set of unresolved molecules to synthesize and any reactants are added to the set (unless they belong to the set of basic initially-available reactants). The LLM can't backtrack, but it does do a series of these linear searches and whenever the same choice is made from the same intermediate state, the old cached tool call results are reused. The series of partly-overlapping linear searches runs are combined to form a synthesis graph – essentially showing the part of the retrosynthesis search tree that was explored by all the runs. The shortest path from the initial target molecule state to the state where there are no remaining intermediate molecules to be synthesized is then extracted. The paper evaluates against a number of deep learning baselines as well as ablations that just perform SFT instead of RL, and achieves a decent boost in performance, especially on an out-of-domain test set.
优缺点分析
Overall I believe this is an interesting paper with a novel contribution to an important area. The results of the method are fairly strong, and while I’ve listed a number of weaknesses none are particularly major. I would support this work being published in NeurIPS.
Strengths
- The core idea – RL based training for LLM based retrosynthesis – is to my knowledge novel, and they've executed it well. The paper is overall clearly written and understandable.
- The adaptation of RL for LLMs operating on multi round tool calls has also been done in contemporaneous work but I don't believe that interferes with the novelty claims here, so this is also a nice contribution (see Weaknesses for note about citation).
- The results are fairly strong, and the evaluation is thorough enough that I do feel convinced by them. The proposed method achieves higher pass @ 1 than all the other methods. When scaling to higher iterations, it performs a little worse than PDVN, but despite this it still finds more synthesis pathways that are shorter than the reference solution than PDVN does. And when moving to out of distribution data, they also outperform PDVN – this makes a fair bit of sense to me, I would expect pretrained LLMs to be a bit more robust to distribution shifts than other deep learning methods.
- The ablations also seem appropriate. Ablating the use of RL away in particular was certainly an important setting and shows that RL is key to their good performance.
Weaknesses
- I was surprised that I didn't see any LLM-based retrosynthesis work cited in the related work. On looking around a bit I found for example "Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning" (Liu et al 2024) and possibly "BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction" (Yang 2024). I'm not familiar enough with the area to say if these are the right citations, but certainly I'd expect some LLM-based retrosynthesis in the related work.
- There's been some other work that has done LLM based multi-round tool-calls with RL, which I believe it is recent enough that it just counts as contemporaneous work that should be cited (per NeurIPS 2025 call for papers) but doesn't interfere with the claims of this work. In particular "ReTool: Reinforcement Learning for Strategic Tool Use in LLMs" (April 15 2025), which appears to also similarly mask tool outputs (section 2.3.2).
- The graph-based shortest path approach is a reasonable algorithm, but the language around it feels a little excessively flowery in the paper, such as the line "effectively mimicking a chemist’s comprehensive exploration process" (in the 3rd contribution from the intro), as well as in section 3.4. Additionally, my understanding (feel free to correct me of course) is that Retro* uses a shortest path algorithm as well (section 3.5 of the Retro* paper) to extract a solution from their search graph as well. It’s listed in the contributions of this paper (Retro-R1), which I think is fine, but I would just view it as a smaller contribution than the others.
- The approach doesn't seem to use the AND-OR tree of Retro* (among other work) — why not? Instead each run of search here produces a linear chain where each node has a sets of molecules rather than a single molecule. In the AND-OR tree each chosen reaction produces an AND of the different reactant molecules that are needed, which can each have their own retrosynthesis search subtrees. That seems like it would allow for more sharing than this setup, and the benefit of not doing that is not clear to me. In the current setup I assume if one branch of search had {A, B} as a set of target molecules and another had {B,C} these couldn't be merged into the same node, whereas in the AND-OR tree each reactant gets its own node so both B nodes could be the same. This would then allow for more possible paths when running the shortest path algorithm. Correct me if I'm misunderstanding though!
- There are no repetitions / error bars on the baseline methods in table 1. This isn't a huge deal to me but would have been nice.
quick fixes and suggestions:
- Figure 2 the font size and indents could be a bit smaller, because the amount of text wrapping that's happening makes it difficult to read.
- Line 323 it'd be helpful to say what "wos" means in the main text too, instead of just in the figure caption.
- While overall I didn't have issues with clarity/readability, the very first paragraph could use some revision – some suggestions are: removing the comma in the first sentence; the second sentence is too short; the first sentence is a little confusing as it doesn't evoke how retrosynthesis is about working backwards; theres a typo "highe-order".
- Figure 1 is a bit hard to read. It's difficult to know where to start in the figure (eventually I realized the start is at "Target mol1"). Visually making this clearer could help. The label "Training" with a dotted red line box is a bit confusing, as doesn't training also include things outside the box too? It seems like the red box is actually meant to single out one round (Round 0).
- While the citation that's present is helpful, lines 216-218 should more explicitly state that the multistep synthesis routes are from the prior work that's cited, rather than saying the routes "are constructed".
问题
- See questions in “Weaknesses”
- How does your work relate to the 3 pieces of related work mentioned above?
- I was still a little uncertain about what Figure 4 is showing, even after reading the text and the caption, but I think I've figured it out. Is it that if you imposed the route length limit given by the x axis, you would get the success rate on the y axis?
局限性
Yes.
最终评判理由
My issues were resolved in the rebuttal – particularly those around citing related work which I found most important. While I agreed with issues raised by other reviewers, reading their rebuttals and discussions largely resolved my concern with those as well.
格式问题
None.
Question1: LLM-based retrosynthesis works are not cited in the related work. "Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning" (Liu et al 2024) and "BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction" (Yang 2024).
Response: We appreciate the reviewer for pointing out these great works. Recent vertical chemical Large Language Models (LLMs) are primarily trained using Supervised Fine-Tuning (SFT) and are typically evaluated across various tasks, often benchmarked against general LLMs. However, they generally do not surpass specialized methods in performance.
Regarding Liu et al. (2024), this work explores both single-step and multi-step retrosynthesis. For multi-step retrosynthesis, the LLM is primarily used as a scoring function integrated into an A* search algorithm. The original paper does not include comparisons with specialized models, and crucially, the code for their synthesis planning is not open-sourced, making direct comparison difficult.
Yang (2024) focuses solely on single-step retrosynthesis. While their model could potentially serve as a replaceable component within our framework, its performance has not been shown to exceed that of existing specialized retrosynthesis models.
We agree that these and other relevant LLM-based works should be cited to provide a comprehensive overview of the field. We will revise our manuscript to include these references and discuss their relevance within the related work section.
Question2: Other works that have done LLM based multi-round tool-calls with RL are not cited. In particular "ReTool: Reinforcement Learning for Strategic Tool Use in LLMs" (April 15 2025), which appears to also similarly mask tool outputs.
Resonse: We appreciate the reviewer for highlighting the relevant work "ReTool: Reinforcement Learning for Strategic Tool Use in LLMs" (April 15, 2025). This is indeed an excellent paper, and it demonstrates the significant potential of multi-round tool-call reinforcement learning (RL) particularly for solving mathematical problems by equipping large language models with code execution tools. We will certainly cite it in our revised manuscript.
We've noted the similarities in the rollout process design between ReTool and our own work, where models interact with the environment using specific tags. ReTool also employs an Interpreter Feedback Mask to prevent interpreter tokens from contributing to the loss calculation, which aligns with our approach to handling tool outputs. However, ReTool doesn't explicitly introduce new RL algorithms; for example, details regarding the calculation of Generalized Advantage Estimation (GAE) or the specific design of the value function loss function aren't thoroughly discussed.
Question3: The graph-based shortest path approach is a reasonable algorithm, but the language around it feels a little excessively flowery in the paper (in the 3rd contribution from the intro, as well as in section 3.4). Retro* uses a shortest path algorithm to extract a solution from their search graph, which is a smaller contribution than the others.
Response: Thank you very much for giving this point. The primary purpose of the 3rd contribution is to design a memory mechanism that reuses existing information to fully utilize the budget for constructing the synthesis graph. Subsequently, a breadth-first search is employed within this synthesis graph to identify the shortest path.
We observed that human experts, when planning synthesis routes, also analyze existing exploration history to draw new paths in the synthesis graph. This behavior is closely mirrored by our design, hence the description "mimicking a chemist’s comprehensive exploration process." We believe this represents a crucial step in imbuing LLM agents with a chemist's mindset, which is vital for creating truly autonomous chemist agents.
Question4: The approach doesn't seem to use the AND-OR tree of Retro* (among other work). In the current setup if one branch of search had {A, B} as a set of target molecules and another had {B,C} these couldn't be merged into the same node, whereas in the AND-OR tree each reactant gets its own node so both B nodes could be the same. This would then allow for more possible paths when running the shortest path algorithm.
Response: Thank you for this valuable suggestion. We did explore using an AND-OR tree in our early attempts. However, to enable the large language model (LLM) to understand the current synthetic state, we found it necessary to convert the AND-OR tree into natural language. The challenge arose when the number of nodes became substantial; clearly describing all relationships between these nodes resulted in an extremely long context, significantly impacting the LLM's performance. This is why we opted for the more concise super-node tree.
Our current design incorporates the spirit of the AND-OR tree. For example, in your provided case of nodes {A, B} and {B, C}, the environment internally stores molecules A, B, and C as individual entities. The memory mechanisms help to avoid re-invoking the single-step model during the repeated processes. We haven't fully leveraged this granular information beacuse the LLM is unaware of it.
Your suggestion for improving the efficiency of our search algorithm by more explicitly incorporating the benefits of the AND-OR tree structure is a very important direction for our future work. We appreciate you highlighting this potential enhancement.
Question5: There are no repetitions / error bars on the baseline methods in table 1.
Response: Thank you for pointing out the problem of the absence of repetitions or error bars for the baseline methods in Table 1.
The reason for this is that the baseline methods employ deterministic search algorithms. Once the model weights are fixed, repeated runs will yield identical results. Since we primarily utilized the models open-sourced by the original baseline papers, and did not retrain them multiple times ourselves, we are unable to provide variance for these baselines.
Question6: I was still a little uncertain about what Figure 4 is showing, even after reading the text and the caption, but I think I've figured it out. Is it that if you imposed the route length limit given by the x axis, you would get the success rate on the y axis?
Response: Your understanding is absolutely correct. Figure 4 aims to demonstrate that under the same route length constraints, our model achieves a higher success rate. Furthermore, our model possesses the capability to find very long routes for challenging molecules. In contrast, baseline methods tend to explore broadly during their search, which limits their ability to delve into deep routes.
Question7: Quick fixes and suggestions: Figure 2 the font size and indents could be a bit smaller. Line 323 it'd be helpful to say what "wos" means in the main text. The very first paragraph could use some revision. Figure 1 is a bit hard to read. Lines 216-218 should more explicitly state that the multistep synthesis routes are from the prior work that's cited.
Response: Thank you very much for these actionable suggestions. We agree that all the points you've raised are important, and we will address them in the revised version of the manuscript:
Figure 1: We will redraw it to be clearer. Your understanding is absolutely correct. The dotted red line box indicates one trajectory in reinforcement learning training contains only one round.
Figure 2: We will optimize the font size and indents in Figure 2 for better readability.
First Paragraph: We will revise the introductory paragraph to improve its clarity and precision.
Lines 216-218: We will make it more explicit that the multi-step synthesis routes are from the cited prior work. We greatly appreciate these valuable comments, as they are crucial for enhancing the quality of our paper.
Line 323 "wos": We will explicitly define "wos", without shuffle, in the main text.
Thank you for the rebuttal.
Q1, Q2: Thank you for these discussions and for agreeing to include these papers and/or others in the related work.
Q3: I think this is a good step to have in the algorithm but still don't particularly believe the flowery language is needed, but I won't push too hard against it. And as mentioned in my review, since this shortest path approach is also used in prior work like Retro* I still view it as minor among the contributions.
Q4: Got it, so you cache for individual molecules but don't get the benefit of more possible paths as in the {A,B} {B,C} example. As a suggestion for future work, I wonder if you could actually just split the supernode graph into an AND-OR graph as a post processing step right before running the shortest path algorithm. After all you know which molecule in the supernode is affected by each reaction. I think you could reconstruct the unique AND-OR graph that corresponds to a given supernode graph.
Q5, Q6, Q7: Makes sense and sounds good, thank you.
Thank you for your comments. We are glad that our previous response has addressed most of your concerns. Regarding your remaining points, our detailed responses are as follows:
Question1: I think this is a good step to have in the algorithm but still don't particularly believe the flowery language is needed, but I won't push too hard against it. And as mentioned in my review, since this shortest path approach is also used in prior work like Retro* I still view it as minor among the contributions.
Thank you for this valuable feedback. We agree that our contributions should be presented as clearly and directly as possible. We will revise the manuscript to avoid any "flowery language" to ensure our claims are well-received by readers.
Question2: As a suggestion for future work, I wonder if you could actually just split the supernode graph into an AND-OR graph as a post processing step right before running the shortest path algorithm. After all you know which molecule in the supernode is affected by each reaction. I think you could reconstruct the unique AND-OR graph that corresponds to a given supernode graph.
Thank you for this excellent suggestion. We have followed your advice and implemented a post-processing step to convert our synthesis graph into an AND-OR graph before running the shortest path algorithm.
This AND-OR graph is distinct from the AND-OR trees used in works like Retro*. In our graph representation, each molecule appears only once (allowing for multiple parents), and the structure permits cycles. We have implemented the shortest path algorithm on this AND-OR graph, and the results are as follows:
Number of shorter routes compared to the ground truth routes on Retro-190:
| Shorter routes | |
|---|---|
| Retro-R1-SuperNode | 72.3 |
| Retro-R1-AndOr | 87 |
Success rate with respect to route length limit on Chembl-1000:
| 5 | 10 | 15 | 20 | 25 | 30 | |
|---|---|---|---|---|---|---|
| Retro-R1-SuperNode | 65.4 | 77.9 | 81.2 | 82.5 | 83.2 | 83.7 |
| Retro-R1-AndOr | 67.4 | 79.4 | 82.7 | 83.4 | 83.9 | 83.9 |
For future work, we plan to explore enabling large models to directly understand the AND-OR graph structure, which we believe could further enhance performance.
We hope our responses have fully addressed all your concerns. Thank you once again for your constructive suggestions, which have helped us improve our work significantly.
Wow, I'm impressed that you implemented that – nice to see it gives a little benefit!
I don't have any more questions – given the promises to broaden related work and other changes, I've raised my score.
Thank you so much for your support and for deciding to raise your rating! We truly appreciate it and are grateful for your thoughtful review and constructive feedback.
This paper proposes RETRO-R1, a framework that conceptualizes multi-step retrosynthesis as a problem of sequential decision-making under a partially observable environment. An LLM-based agent is optimized through reinforcement learning to perform iterative decomposition, selecting molecules and invoking external tools to predict reactions. The method capitalizes on the pre-trained reasoning and tool-use capacities of LLMs, demonstrating robust and competitive performance on canonical retrosynthesis benchmarks.
Suggested revisions:
- Clarify evaluation metrics (e.g., emphasize route diversity).
- Expand discussion on computational trade-offs (e.g., cost vs. interpretability).
- Cite the LLM-based retrosynthesis works to contextualize contributions.