LLMs Can Reason Faster Only If We Let Them
Enabling faster large language model solutions for autonomous reasoning and planning
摘要
评审与讨论
This work addresses the intermediate step solution length of Chain of Thought style exploration algorithms for planning problems, using supervised fine-tuning of LLMs on Algorithm of Thoughts (AoT)-style plans and utilizing a reinforcement learning mechanism for rewarding less verbose plans and thereby reducing solution length. The experimental evaluation with three small-size LLMs (1-3B) on three different planning game domains indicates that solution lengths are reduced quite significantly, while the accuracy of solved problems is maintained or improved.
Update after Rebuttal:
The authors' further experiments on IPC domains such as Blocksworld and Logistics, and experiments utilizing fewer examples for SFT & RL baselines are appreciated. Although there may be differences in the newly obtained results in the accuracies and solution length steps, especially with the SFT baselines, I believe the performance of the proposed method stands up to the scaling attempts, and conclusions should primarily hold without much perturbation. Other questions have been answered, and clarity/ wording recommendations have been accepted. Thus, my score has been increased from a Reject to a Weak Accept.
给作者的问题
-
I request the authors to indicate the complexity distribution in the test set for each of the domains. For example, in N-Puzzle, in the 100 examples of the test set, what is the distribution of the edit distance that is claimed to be ranging from 15 to 30? I assume the Manhattan distance mentioned is equivalent to the planning edit distance. Similarly, for Word Ladder, what is the soln path length range distribution in the test set?
-
Why is the standard Tree-of-Thoughts baseline without Finetuning not included? In my understanding, ToT may have a more systematic exploration of the state space with possibly lesser solution length.
-
Section 3.2, Figure 2, what does the Y-axis represent? Why does AoT-medium hit 4 and the other 2 settings only hit 3?
论据与证据
-
The claim that SFT on AOT-style plans makes the LLM learn to systematically explore the state space is an oversell, if not completely wrong. There is no reliable evidence that SFT does anything to help systematic exploration of the state space. It is well known that SFT improves the token-level inference and approximate retrieval ability for similar types of problems encountered during fine-tuning. So, in my understanding, all SFT does is make the exploration less random than before.
-
To be clear, the proposed RL stage with step-count penalties for correct solutions to reward concise solutions with a lesser number of intermediate (thinking) steps is only providing a weak reward signal for conciseness or applying optimization pressure, as alluded to in certain parts of the paper, but not directly optimizing for it as an objective.
-
The Word-Ladder domain is claimed to provide additional complexity for LLMs because actions must be derived from the LLMs' prior knowledge of valid English words. In my perspective, this is not true. If anything, I believe this domain is an advantage for LLMs. It's not clear why this domain would be more complex for LLMs compared to standard planning domains.
方法与评估标准
-
The training set and test sets have a disproportionate number of samples, which opens the possibility for a biased evaluation as the test set may not be fully representative of the data distribution. It is unclear what is the distribution of complexity of problems in the test set.
-
The domains, Game of X and Word Ladder are relatively less sophisticated planning domains compared to standard planning domains such as Blocksworld, Grid, Logistics, etc. It is unclear if this method can be evaluated on such domains as the number of training examples and AOT style plan creation needed for the SFT stage are quite high and effort-intensive. However, I appreciate the work for using the N-Puzzle domain which is known to be quite complex in planning, as the problems can contain negative interactions between subgoals.
理论论述
N/A
实验设计与分析
- Section 3.2 is largely unclear. I do not understand what is the core motivation and attempt of that experiment. The graph in Figure 2 is also unclear.
补充材料
I have reviewed the supplementary material sections A and B.
与现有文献的关系
- This work tries to address the high solution length problem seen in the Algorithm of Thoughts work. The work's direction could be useful for creating better LLM-based plan generators as it improves the solution length and accuracy metrics for basic planning problems.
- The work's direction could also be useful in informing the community involved in post-training techniques for Reasoning Language Models (RLMs) such as O1 and DeepSeek to obtain data with shorter solution lengths to perform post-training LLMs.
遗漏的重要参考文献
- References in Section 1 for the LLM-Modulo framework are incorrectly specified.
其他优缺点
Other Strengths:
- The paper is largely clear.
Other Weaknesses:
-
In my understanding, the method is not scalable. In my view, with SFT technique being heavily involved in the proposed method, there is an inherent difficulty in including standard planning domains such as Blocksworld, Grid, Logistics, and others used in the International Planning Competition, as the SFT stage with AoT style plans utilizes a large number of examples. As for the standard planning domains, it could be harder to create sufficient examples for SFT, only to then have the RL stage weakly influence the LLM to generate shorter solution steps. Thus, the proposed method is not scalable to real-world domains in my opinion.
-
As the RL stage reward formulation only applies optimization pressure by providing a weak reward signal for conciseness, there is no guarantee that there will not be significant backtracking even if the solution length is relatively lesser than CoT plans, since there is no way to keep track of open and closed lists like in standard search algorithms such as A* search. Hence, the optimization pressure could increase action hallucinations and other types of errors from the LLM.
其他意见或建议
Typos:
- section 4.2, Line 4.
Suggestions:
- Section 3.2 needs to be re-written or made clear. Figure 2 needs to be explained more thoroughly.
- Section 4.1 thinking component output examples need to be made more clear, including explaining all the relevant symbols.
We thank the reviewer for their extremely detailed review. We have acted upon their suggestions to add new experimental results and provide clarifications.
Response to Claims and Evidence
-
In Table 2, we see that AoT-SFT is performing much better than CoT-SFT, indicating better exploration. Furthermore, we chose the word "systematically" to highlight the deliberate search in the output of the LLMs, in opposed to within-layer exploration that happens with CoT. However, we will change that sentence accordingly to reviewer's suggestions to prevent any misunderstandings for future readers.
-
The RL stage is first incentivizing correct solutions, and provides a conciseness reward only if the solution is correct, therefore can be considered a "weak" reward signal in RL terminology. We agree with the reviewer, and will update this part. However, we want to also add that this way of providing reward to the LLMs is much more realistic, since we are not assuming any heuristic knowledge to give much more fine-grained feedback.
-
In Table 2, CoT-SFT has performed worst in Word-Ladder. Furthermore, LLMs can sometimes have difficulties with identification of individual letters in words due to tokenizers. We will update the paper to include these discussions.
Response to scalability concerns
Originally, we opted to use a high number of training examples to show the improvements we were getting with AoT-O3 was due to SFT models being under-trained. However, this has raised concerns from the reviewers, therefore we decided to repeat the experiments with reduced number of SFT and RL examples. For all three tasks (Game of X, N-Puzzle, Word-Ladder), we used 320 examples to supervise-finetune models (batch size = 64, epochs = 10, training steps = 50, lr=2e-4). For AoT-SFT model, we generated new random augmentations at each epoch for the same 320 examples. For RL, we used 160 examples (batch size = 32, epochs = 10, lr = 2e-5). For ToT, we have used breadth of 5. We have evaluated on 1000 test examples.
| Problem | Method | Accuracy (%) | Solution Length (steps) | ||||
|---|---|---|---|---|---|---|---|
| Gemma2-2B | Llama3-1B | Llama3-3B | Gemma2-2B | Llama3-1B | Llama3-3B | ||
| Game of X | ToT | 26.4 | 20.4 | 27.1 | 18.2 | 17.5 | 18.3 |
| CoT-SFT | 30.9 | 22.3 | 30.7 | 4 | 4 | 4 | |
| AoT-SFT | 58.7 | 53.1 | 62.9 | 33.2 | 30.8 | 36.9 | |
| AoT-O3 | 76.6 | 71.2 | 79.4 | 17.9 | 9.7 | 11.5 | |
| N-Puzzle | ToT | 29 | 30 | 33 | 80.0 | 84.9 | 63.9 |
| CoT-SFT | 32.3 | 33.7 | 40.3 | 7.6 | 7.7 | 9.2 | |
| AoT-SFT | 62.1 | 58.7 | 66.3 | 26.7 | 26.6 | 32.7 | |
| AoT-O3 | 72.2 | 66.1 | 75.7 | 17.4 | 16.7 | 14.1 | |
| W-Ladder | ToT | 12.5 | 9.2 | 17.8 | 38.5 | 36.0 | 32.0 |
| CoT-SFT | 10.8 | 5.0 | 13.4 | 10.1 | 6.8 | 9.4 | |
| AoT-SFT | 58.4 | 48.8 | 62.1 | 29.5 | 28.5 | 34.1 | |
| AoT-O3 | 62.3 | 52.6 | 67.3 | 18.7 | 13.2 | 16.2 |
Response to weak reward signal concerns
In order to alleviate concern's for the reward signal, we experimented in the reviewer's suggestion domains (BlocksWorld and Logistics) without any SFT stage, and simply applying the RL stage (we used the same setup as in the previous results) on models prompted with five in-context examples. We followed the training and testing settings in [1] and for ToT we again used breadth of 5. We used the recently released Gemma3 4B IT model for its good performance and relatively small size, and also that we believe it will be the relevant model to use in the remainder of 2025.
| Problem | Method | Accuracy (%) | Solution Length (steps) |
|---|---|---|---|
| BlocksWorld | CoT | 27.2 | 6.3 |
| ToT | 56.3 | 45.2 | |
| AoT | 59.1 | 24.2 | |
| AoT-O3 | 74.7 | 11.3 | |
| Logistics | CoT | 18.5 | 12.6 |
| ToT | 51.5 | 70.8 | |
| AoT | 49.3 | 55.9 | |
| AoT-O3 | 71.9 | 24.8 |
[1] https://arxiv.org/pdf/2501.13545
Response to Questions
-
Thank you for asking this clarification. For N-Puzzle, after shuffling randomly, we selected examples with almost uniform Manhattan Distances (MD) ranging from 15 to 30 (in the original 100 test examples, 10 examples for MD=15, and 6 examples each for MD=16,..,30). We also followed the same logic for Word-Ladder, this time solution path length range from 4 to 10 steps.
-
We have included ToT without finetuning. However, it's not completely correct that ToT has systematic exploration, since each path is disjointed as they appear in the context, and this can lead to visiting the same states at separate time steps.
-
Figure 2 is a histogram, where Y axis is the number of correct instances. We'll update Section 3.2 in camera-ready to prevent confusions.
Conclusion
We thank you in advance for checking our rebuttal and reconsidering your score if you believe our explanations and the introduction of new results alleviated your concerns. If you have further questions, as far as we understand from the new ICML guidelines, you can only respond one more time (review → rebuttal → reviewer reply → author reply). Therefore, we kindly request you to be precise about your remaining concerns in your response.
Thank you for the updates on the new results. My primary concerns have been largely addressed or agreed to be addressed. Thus, I have updated my score. However, I remain unconvinced about the potential of the whole direction in contrast to external verification.
We thank the reviewer for increasing their score in favor of acceptance of our work. As agreed, all new results will be integrated into the camera-ready version. Algorithm of Thoughts (AoT) [1] has shown to be competitive or more performant than ToT and AoT+ [2] has shown even further improvements can be made to match or surpass external verification methods such as LLM-Modulo, all while being more efficient in terms token usage and the number queries required to get an answer. In this work, we show that AoT-O3 with reinforcement learning can be used to further improve the performance and efficiency (up to 80%). We show that even with relatively small LLMs, AoT-O3 reduces the need for external verification, which makes it more practical and scalable to new problems.
We sincerely extend our gratitude to the reviewer for taking the time to really delve into our work for their review. Their feedback has helped improve the clarity of the paper significantly.
[1] Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models (ICML 2024) https://openreview.net/forum?id=KJL2b6BthC
[2] LLMs Can Plan Only If We Tell Them (ICLR 2025), https://openreview.net/forum?id=K3KrOsR6y9
This paper introduces AoT-O3, a framework that combines supervised fine-tuning and reinforcement learning to enhance the planning efficiency of large language models (LLMs). By using a handcrafted reward function that optimizes both solution validity and length, AoT-O3 reduces reasoning steps while maintaining or improving accuracy. The framework is evaluated on three planning benchmarks using three different-sized models, demonstrating its good performance.
update after rebuttal
The method proposed in this paper demonstrates strong performance across multiple benchmarks and shows good generalization to multiple LLMs. Furthermore, the writing is clear and well-structured. Based on these strengths, I believe the paper is worthy of acceptance.
给作者的问题
See the weaknesses.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
No, there is no theoretical claim.
实验设计与分析
Yes.
补充材料
Yes, I reviewed the entire supplementary material.
与现有文献的关系
A simple yet effective handcrafted reward function can help LLMs reduce reasoning steps while maintaining or improving planning accuracy.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The proposed handcrafted reward function, which balances solution validity and length efficiency, provides clear and direct feedback to the model without the need for complex learned reward models.
- The proposed method demonstrates promising results in achieving higher planning accuracy and shorter solution lengths.
Weaknesses:
- The paper lacks an experimental comparison with a method that directly applies RL to the model without SFT.
- The paper mentions an alternative to the reward model (i.e., Equation (4)), but its effectiveness is not evaluated in the experiments. Including this evaluation would make the study more solid.
- The paper lacks qualitative examples for analysis and demonstration.
- A discussion on the limitations of the proposed method is missing.
其他意见或建议
On page 7, line 345, the statement "We evaluate AoT-O3 across two challenging planning benchmarks: Game of X and N-Puzzle." should be revised to indicate that there are three benchmarks, including Word Ladder.
We thank the reviewer for taking the time to review our paper. We are encouraged that our paper is recognized as "simple yet effective" and to demonstrate "promising results in achieving higher planning accuracy and shorter solution lengths".
Response to Weaknesses
- We experimented in BlocksWorld and Logistics domains without any SFT stage, and simply applying the RL stage directly (we used the same setup as in the previous results) on models prompted with five in-context examples. We followed the training and testing settings in [1] and for ToT we again used breadth of 5. We used the recently released Gemma3 4B IT model for its good performance and relatively small size, and also that we believe it will be the relevant model to use in the remainder of 2025.
| Problem | Method | Accuracy (%) | Solution Length (steps) |
|---|---|---|---|
| BlocksWorld | CoT | 27.2 | 6.3 |
| ToT | 56.3 | 45.2 | |
| AoT | 59.1 | 24.2 | |
| AoT-O3 | 74.7 | 11.3 | |
| Logistics | CoT | 18.5 | 12.6 |
| ToT | 51.5 | 70.8 | |
| AoT | 49.3 | 55.9 | |
| AoT-O3 | 71.9 | 24.8 |
[1] https://arxiv.org/pdf/2501.13545
-
Thank you for pointing this out, our preliminary experiments with that reward model was that it performed similarly. However, we will include those results in the appendix in the camera-ready version of our paper.
-
We will again be sure to include qualitative results in the camera-ready version of our paper.
-
We will include limitations of our method in the camera-ready version of our paper.
Conclusion
We will also fix the typos mentioned in the camera-ready version. We thank you in advance for checking our rebuttal and reconsidering your score if you believe our explanations and the introduction of new results alleviated your concerns. If you have further questions, as far as we understand from the new ICML guidelines, you can only respond one more time (review → rebuttal → reviewer reply → author reply). Therefore, we kindly request you to be precise about your remaining concerns in your response. Also, please be sure to check our responses to other reviewers where we have introduced new results with utilizing fewer training data, and applying RL without the SFT stage to show the importance of AoT-O3, even in cases where we do not have the SFT stage.
Thanks for the authors' response. I keep my original score.
This paper proposes AoT-O3, a RL based approach that reduces solution length in LLM based planning while preserving improving accuracy. It builds on already popular Algorithm of Thoughts by adding a reward function. Experiments on 3 benchmarks show a reduction in reasoning steps and higher success rates compared to existing baselines.
给作者的问题
Can you explain why current AoT implementations may be imitating rather than truly engaging in System 3 thinking?
论据与证据
The claims made in this paper are coherent.
方法与评估标准
The authors demonstrate their approach on multiple diverse tasks (math-based puzzles, sliding tile puzzles, word-ladder transformations), suggesting good generalization potential.
理论论述
No major theoretical claims mentioned in the paper.
实验设计与分析
Experiments looked logical and sound to me.
补充材料
I didn't feel the need to go through the supplemental material.
与现有文献的关系
The paper merges a cost sensitive RL alignment with Algorithm of Thoughts to improve the planning capabilities.
遗漏的重要参考文献
They cite the relevant papers to the best of my knowledge.
其他优缺点
Strengths:
- The paper is well written and easy to understand.
- This area of research is highly significant: boosting model accuracy while cutting power consumption and simplifying the training pipeline is essential.
- Unlike exploratory techniques like Tree of thought, AoT-O3 operates within a single pass, making it more computationally viable and simpler to deploy.
Weakness:
- Their approach provides only incremental novelty, as it simply extends a standard RL alignment with a step-cost penalty, offering little beyond an otherwise well known method.
- The success of the length-aware reward depends upon careful hyperparameter tuning which might be non-trivial for new domains.
- Since the paper uses a RLHF design, it would make sense to include SFT with other RL training baselines like PPO.
其他意见或建议
Line 308/309 do not make much sense : "we with the model to generate solutions that are both correct and efficient"
We thank the reviewer for taking the time to review our paper. We are encouraged that our paper is recognized as "computationally viable" and "simpler to deploy". We have acted on the reviewer's suggestions to provide new experimental results and provide clarifications.
Response to Weaknesses
-
Our whole approach includes the identification of unnecessary exploration in AoT, and repurposing already existing RL libraries to improve performance and efficiency. As you have also pointed out, "boosting model accuracy while cutting power consumption and simplifying the training pipeline is essential". Recognizing that simplicity being the ultimate sophistication, we see our whole approach as much more practical and easier to use.
-
We have used the same hyperparameters (, and ) in all five domains, including the newly added Blocksworld and Logistics, showing significant improvements in all the domains. We would like to add that these domains are highly different in terms of their state and actions, implying the generalizability of the hyperparameters we used. Moreover, we used the same hyperparameters in our new results with PPO, showcasing the generalizability of the hyperparameters.
-
Thank you for these suggestions. We chose RLOO for being more VRAM friendly. However, we added new results for PPO for Gemma2-2b and Llama-3.2-1B with PPO instead of RLOO.
| Problem | Method | Accuracy (%) | Solution Length (steps) | ||||
|---|---|---|---|---|---|---|---|
| Gemma2-2B | Llama3-1B | Llama3-3B | Gemma2-2B | Llama3-1B | Llama3-3B | ||
| Game of X | AoT-O3 | 80 | 75 | 83 | 15.1 | 10.2 | 12.8 |
| N-Puzzle | AoT-O3 | 74 | 69 | 77 | 16.4 | 15.8 | 17.2 |
| W-Ladder | AoT-O3 | 61 | 51 | 66 | 19.5 | 14.4 | 15.7 |
Response to Questions
- Thank you for this highly relevant question to our comment in the introduction of our paper. As we have shown in the Section 3.2, the solution lengths of AoT are highly effected by the length of the in-context examples. This hinted us that shorter solutions were indeed possible, which lead to our method. In our perspective, shorter solutions being possible, indicates that the model is unnecessarily exploring for significant amount of time before giving an answer. The RL stage reducing these total steps significantly, shows us that the AoT-prompted or the AoT-SFT models imitating a significant part of the exploration, therefore imitating.
Conclusion
We will also fix the typos mentioned in the camera-ready version. We thank you in advance for checking our rebuttal and reconsidering your score if you believe our explanations and the introduction of new results alleviated your concerns. If you have further questions, as far as we understand from the new ICML guidelines, you can only respond one more time (review → rebuttal → reviewer reply → author reply). Therefore, we kindly request you to be precise about your remaining concerns in your response. Also, please be sure to check our responses to other reviewers where we have introduced new results with utilizing fewer training data, and applying RL without the SFT stage to show the importance of AoT-O3, even in cases where we do not have the SFT stage.
The authors claim that while AoT has improved performance compared to CoT, they generally suffer from significantly longer solutions. To deal with this, they proposed AoT-O3, which use AoT generated solutions to optimize models with both accuracy (supervised and RL) and solution length (RL). The proposed method AoT-O3 shortens solution length by up to 80% compared to baseline AoT while maintaining or surpassing performance.
给作者的问题
- What is the y-axis in Figure 2?
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
There are no theoretical claims in the paper.
实验设计与分析
Yes I have checked the soundness/validity of any experimental designs or analyses
补充材料
Yes, I reviewed the appendix.
与现有文献的关系
See the section below.
遗漏的重要参考文献
- The paper shares many similarity with another branch of papers that learns to simplify reasoning steps. The authors should also mention/analyze/compare to those related works. For example, [1]. The example in Figure 1 especially look like a skipped step.
[1] Liu, Tengxiao, et al. “Can language models learn to skip steps?.” arXiv preprint arXiv:2411.01855 (2024).
其他优缺点
Strengths:
- The paper is very straightforward and easy to understand.
Weaknesses:
- The introduction parts seem to be a bit too long. The method section only starts from the 5th page.
- ‘In this section, we will use V for our reward models due to them actually being value models used in RL frameworks. ’ -> this sentence looks a bit weird, you are not using V as the reward model then, you are directly modeling the value function, reward function is a separate thing. For all the parts in the paper that calls V a reward model, they look a bit confusing.
其他意见或建议
- Table 2 is a bit hard to read. Can you also highlight/bold the Solution Lengths?
We thank the reviewer for taking the time to review our paper. In the following, we provided clarifications and our responses to the reviewer's concerns.
Response to Essential References Not Discussed
Although, authors cannot be expected to discuss papers that have not been made public before 4 months according to ICML reviewer guidelines (arxiv submission: 4th of November, NeurIPS public announcement: 6th of November), we want to assure the reviewer that the work suggested is a way to reduce the number steps in CoT reasoning by skipping steps. Our work on the other hand, is not about skipping steps, but eliminating unnecessary exploration of actions. Therefore that paper does not have much in common with our paper. Our example in our paper's Figure 1 indicates the deliberately exploring promising states, and not exploring for the sake of exploration, a phenomenon we explored further in Section 3.2. However, we will include these comparisons in the camera-ready version of our paper.
Response to Weaknesses
- Since we have added comprehensive new set of results as per the request of other reviewers, we will shorten the introduction to include the relevant new results in the main paper instead of adding them in the appendix in the camera-ready version of our paper.
- We agree with the reviewer, we will update such usage to prevent confusions regarding to usage of the "reward" and "value" terms.
Response to Questions
- Figure 2 is a histogram, where Y axis is the number of correct instances. We'll update Section 3.2 in camera-ready to prevent confusions.
Conclusion
We will also fix the typos mentioned in the camera-ready version. We thank you in advance for checking our rebuttal and reconsidering your score if you believe our explanations and the introduction of new results alleviated your concerns. If you have further questions, as far as we understand from the new ICML guidelines, you can only respond one more time (review → rebuttal → reviewer reply → author reply). Therefore, we kindly request you to be precise about your remaining concerns in your response. Also, please be sure to check our responses to other reviewers where we have introduced new results with utilizing fewer training data, and applying RL without the SFT stage to show the importance of AoT-O3, even in cases where we do not have the SFT stage.
This paper proposes AoT-O3, a RL based approach that reduces solution length in LLM based planning while preserving improving accuracy. It builds on already popular Algorithm of Thoughts by adding a reward function. Experiments on 3 benchmarks show a reduction in reasoning steps and higher success rates compared to existing baselines.