To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning
A comparative study of backtracking versus parallel search in LLMs, revealing nuanced trade-offs: backtracking can harm reasoning due to constrained search strategies and verbosity, but is particularly suitable for RL.
摘要
评审与讨论
This paper systematically investigates the effectiveness of backtracking (in the context of test-time compute) when compared to parallel sampling for LLMs in reasoning tasks. Through experiments targeting CountDown and Sudoku, the paper finds that backtracking is not “universally beneficial”. Additionally, they find that training on backtracking examples can hinder model performance, that RL finetuning helps models in backtracking, and that aspects of the tasks (tree depth) impact which approach is best. Overall, the paper includes extensive experiments, careful articulation of the methods and clear and interesting results. I recommend that it be accepted.
接收理由
The paper is well written, it’s problem clear, the insights interesting, and the experiments thorough. I believe that it will lead to interesting future work.
拒绝理由
I don’t have any reasons to recommend rejecting the paper.
给作者的问题
- I found the results in Appendix D particularly interesting. It is currently framed around what makes backtracking better, but I wonder if instead, the issues are around parallel sampling. In problems with large depth, it may be hard for parallel sampling to yield sufficient variety and coverage. I also wonder, if this could be shown by varying the number of filled cells in Sudoku and seeing how that impacts the performance in a more gradient fashion (and similarly for CountDown).
- I’m struck by the rate of unsolved CountDown boards (noted on line 122). Why is this the case? If the model is approximating DFS, will this make CountDown harder for models (even though you’ve sampled the ones that were solvable, it could be that the approximation is still more prone to being incorrect).
Thank you reviewer EL6J for your constructive feedback! We are grateful that you found our work includes extensive experiments, careful articulation of the methods and clear and interesting results that lead to future interesting work.
We address your concerns below:
In problems with large depth, it may be hard for parallel sampling to yield sufficient variety and coverage. I also wonder, if this could be shown by varying the number of filled cells in Sudoku and seeing how that impacts the performance in a more gradient fashion (and similarly for CountDown).
-
Great suggestion! We address this in our analysis by sorting Sudoku problems by search tree depth and evaluating the effectiveness of parallel sampling at each depth level (i.e., number of unfilled cells). Implemented what you suggested, in this figure, we observe that the performance of parallel sampling degrades as problem depth increases. As a reference baseline, backtracking (i.e., sequential search) performance remains stable even as depth increases. This supports your intuition that deeper problems are harder to solve via parallel sampling.
-
We would have liked to conduct the same analysis for Countdown, but in our current setup, each Countdown game uses only 4 candidates, resulting in a fixed tree depth of 3. Extending this analysis would require retraining models on new Countdown with varying depth. We will consider incorporating this additional CountDown analysis in the camera-ready version! That said, the “deep Countdown” experiment we include (Appendix D) already demonstrates the same trend—highlighting tree depth as a key factor in the relative success of backtracking strategies.
I’m struck by the rate of unsolved CountDown boards (noted on line 122). Why is this the case? If the model is approximating DFS, will this make CountDown harder for models (even though you’ve sampled the ones that were solvable, it could be that the approximation is still more prone to being incorrect)
-
We were also struck by how difficult Countdown is to solve, and we verified both the correctness of our implementation and the quality of the search strategy used to generate training traces. The DFS-based solver achieves only a 60% success rate because it uses a sum-based heuristic to rank node expansions and prune the search tree. While effective in many cases, this heuristic can occasionally prune the correct path, especially on harder boards.
-
We explored alternative strategies—including different heuristics (e.g., multiplicative factor–based) and BFS—and selected the most effective among them (see Fig.8). While we cannot guarantee there is a brilliant search strategy for CountDown out there, we believe this situation reflects a broader reality in real-world reasoning tasks: crafting the optimal strategy is often non-trivial and rarely provable. The best we can do is often generating “reasonably” good solution traces.
-
We also shared your concern about whether approximating DFS might make Countdown harder. In fact, in Section 4.3 (Fig. 4b), we address this concern by training on a mix of strategies designed to diversify the solution traces. While this mix-strategy model performs better than the pure DFS-trained model, it still underperforms the direct solution model—suggesting that the difficulty is not just due to approximating DFS, but also stems from intrinsic challenges in the task.
-
Finally, we emphasize that the use of DFS allows for a sharp contrast between tasks: while DFS traces are less effective for Countdown, they enable superior performance in Sudoku, further highlighting the role of task structure in determining the utility of backtracking.
Thank you for your responses to my questions. The finding that depth level impacts performance with parallel sampling is interesting, and I appreciate the authors reporting on this experiment. I continue to think this paper should be accepted.
Thank you for your positive feedback! Given that your review indicates a strong appreciation of the paper’s contributions, we hope you might consider raising the confidence score, as you see fit. This would help ensure your evaluation is fully weighted in the review process.
Thank you for pointing this out. I've raised my confidence score a bit after reading your responses to mine and other reviews.
Thank you!
The paper investigates whether training large language models (LLMs) to perform backtracking offers consistent advantages over parallel sampling methods, especially under fixed test-time compute budgets. Using two strategic games, CountDown and Sudoku, as controlled testbeds, the authors compare models trained on explicit backtracking traces to those trained on correct final solutions. Results show that backtracking models outperform direct solution models in Sudoku but underperform in CountDown. The study attributes this divergence to two key issues: (1) backtracking traces impose rigid, potentially suboptimal strategies; (2) verbosity induced by trace-based training reduces reasoning efficiency. Furthermore, reinforcement learning (via GRPO) helps backtracking models discover new, effective strategies and improves test-time performance, while direct solution models show less gain and even reduced diversity after RL. The paper concludes that backtracking is not universally beneficial and must be judiciously applied based on task structure and model constraints.
接收理由
-
The use of CountDown and Sudoku as diagnostic tasks is methodologically sound, providing a controlled environment to isolate the effects of backtracking and parallel strategies on LLM performance.
-
The paper exposes important limitations of backtracking that are often overlooked in literature emphasizing step-by-step reasoning.
-
The study extends its scope by integrating reinforcement learning (GRPO), revealing how backtracking-trained models benefit more from RL fine-tuning than direct solution models
拒绝理由
-
My biggest concern with the paper is the narrow selection of the two tasks. Whether backtracking or not seems to be very task-dependent. For example, CountDown likely has a broad, shallow search space with many possible solutions that are not too far apart. Parallel sampling works well here because generating multiple independent solutions increases the chance of hitting a correct one quickly. Sudoku, by contrast, has a deep, narrow solution space where most paths lead to dead ends, and intermediate steps must be precisely correct. Sequential search with backtracking excels here because correcting errors mid-way (by backtracking) allows recovery from partial failures. From that point of view, it can be expected that countdown tasks with direct sampling may be more effective, while Sudoku benefits more from the backtracking. It might be beneficial to consider other types of reasoning tasks beyond these two to show that the phenomenon is more general instead of being task-specific.
-
The reliance on DFS-based backtracking traces for training raises concerns. The model's underperformance may stem more from the quality of traces than from backtracking as a strategy per see.
-
The analysis on scaling effects is informative but somewhat shallow. For example, while larger backtracking models plateau in CountDown, the paper doesn’t deeply explore why scale fails to help in overcoming trace-induced biases.
Besides this, I feel this paper is still very interesting. I will raise the score if the authors could consider adding more tasks beyond these two tasks.
给作者的问题
Is it possible to consider other reasoning tasks beyond Countdown and Sudoku?
Thank you reviewer VtTK for your constructive feedback! We are grateful that you found our work exposes important limitations of backtracking that are often overlooked in literature. We also found your suggestion of making the results more general particularly helpful.
We address your concerns below:
It might be beneficial to consider other types of reasoning tasks beyond these two to show that the phenomenon is more general.
-
We agree that making our claims more general will strengthen our paper! Combining comments from reviewer z9oL, we are currently extending our findings to more realistic LLM settings. We are actively working on reproducing our study in math problem-solving tasks. In this setup, we start from a model without backtracking capabilities and, mirroring our toy setting, train both a backtracking model and a direct solution model. We then evaluate their performance under equivalent test-time compute to assess whether our conclusions hold in a more realistic domain.
-
We are also thinking about an additional synthetic task (details in the last response item)!
The reliance on DFS-based backtracking traces for training raises concerns. The model's underperformance may stem more from the quality of traces than from backtracking as a strategy per see.
-
We agree that the backtracking model’s performance depends in part on the quality of its training traces. However, for many reasoning tasks—even relatively simple ones like Countdown—it is non-trivial to generate optimal solution traces without significant human effort. In practice, many real-world reasoning tasks face the same challenge: high-quality traces are difficult to define and scale.
-
In Section 4.3 (Fig. 4b), we explicitly explore this concern through a mix-strategy variant for Countdown. We first carefully analyzed different search strategies (see Fig. 8), and then trained the model on a mix of these strategies to improve training data diversity. Despite these targeted efforts to improve trace quality, the backtracking model still underperforms the direct solution model. This suggests that the limitation is not solely due to trace quality, but reflects deeper, task-specific constraints on backtracking as a viable strategy. Importantly, in the case of Sudoku, models trained on DFS traces outperform the direct solution model, further supporting that the nature of the task plays a critical role.
The analysis on scaling effects is informative but somewhat shallow. For example, while larger backtracking models plateau in CountDown, the paper doesn’t deeply explore why scale fails to help in overcoming trace-induced biases.
-
We agree that model scale alone does not overcome trace-induced biases—and this is expected under a supervised learning objective. Since the model is explicitly trained to imitate DFS-generated traces, there is no optimization signal encouraging it to deviate from the search strategy used to generate the training data. This reflects a limitation of SFT, not of scale itself.
-
For Countdown, the model scales we work with (specifically, the larger ones) are sufficient to learn and reproduce the DFS strategy almost perfectly—effectively modeling the data generation process. Once this capacity threshold is reached, increasing scale provides no additional benefit, which explains the observed plateau. We will clarify this more explicitly in the camera-ready version.
I will raise the score if the authors could consider adding more tasks beyond these two tasks.
As a first step, we are extending our results to math tasks using general-purpose reasoning LLMs and expect to report these results in the coming days.
We focused on Countdown and Sudoku because they represent two extremes in reasoning complexity: Countdown is broad and shallow, while Sudoku is narrow and deep. We fully agree that exploring tasks that lie between these extremes would provide further insight.
- Deep-narrow Countdown (Appendix D): We have experimented with a variant of Countdown that increases search depth and constrains intermediate steps, pushing it toward a deeper, narrower structure
- N-Queens problem: A classic CSP problem sits between Countdown and Sudoku in both depth and complexity, offering a more balanced test case for backtracking. Game description: The goal is to place N queens on an N × N chessboard such that no two queens threaten each other. Solving the problem requires placing one queen per row, making it a sequential decision process with depth N. At each step, the model must choose a valid column position that avoids conflicts, resulting in moderate branching. This positions N-Queens between Countdown and Sudoku in terms of complexity—neither as shallow and wide as Countdown nor as deep and narrow as Sudoku.
While our immediate priority is adding math benchmarks, we aim to include N-Queens for the camera-ready version.
Thanks for considering adding more tasks! I just commented here so that you will still be able to respond after the rebuttal period.
Hello Reviewer VtTK! Thank you for engaging in the discussion and we apologize for the wait. We have added an additional task -- extending our results to real LLMs on Math tasks! Details see below:
- To address your concern on whether our conclusions from two games generalize, we have conducted additional experiments to show that when equating test-time compute budget, the effectiveness of backtracking depends on the nature of the task, and our conclusions generalize to general reasoning LLMs on math tasks. The main result figure can be found here and the full experiment details can be found here. We will include a more comprehensive study based on this result in the camera ready version!
We will include this additional result in the camera ready version and we are still considering adding an additional synthetic setting (N-Queens) as we have mentioned to make our results more robust!
Hello reviewer VtTK, we just wanted to check in to see if you had any thoughts on our updates. We’d be grateful if you could take a look and let us know if it addresses your concerns—your feedback would be very helpful!
The paper focuses on comparing backtracking vs direct solution in test-time computation. The paper proposes two dataset/tasks CountDown and Sudoku. Interestingly, sequential search underperforms parallel sampling on CountDown but outperforms on Sudoku. And then the paper extends the analysis to refinrocement learning more specifically GRPO algorithm and find that teaching backtracking universally improves LLM reasoning.
接收理由
A very neat idea with clear demonstration on both SFT and RL for comparing backtracking and non-backtracking. The paper points out some interesting and fundamental research problem in test-time scaling and even more broadly search strategies in finding the answer in a large space. I like the easy-to-follow tasks.
The paper pays some effort in training and comparing algorithms, rather than simple prompt engineering. It provides some concrete analysis and some insights.
拒绝理由
The task is very limited and it’s not necessarily about LLM. I feel like it’s more about transformers architecture. I would like to see some real tasks or real models on these tasks, to get more sense on how the takeaways of this paper could be transferred to LLM tasks and settings.
Clarification of algorithms and terms, complexity analysis and data statistics. The presentation of results in the paper was awesome, but the methodology part was a little vague to me.
给作者的问题
What’s the search depth of these two problems? Is this the reason why they perform differently on two tasks? It would be good to show some statistics about tree breadth and depth, and more info about that.
I am a little confused about non-backtracking vs direct solution vs parallel sampling. Could you provide more info about what’s the parallel search algorithm here. I think some algorithm table would be useful to differentiate algorithms here.
Thank you reviewer z9oL for your insightful feedback! We are glad that we were able to point out some interesting and fundamental research problems in test-time scaling and even more broadly search strategies in finding the answer in a large space. We found your concerns on not generalizing our findings to real LLMs particularly helpful.
We address your concerns below:
I would like to see some real tasks or real models on these tasks, to get more sense on how the takeaways of this paper could be transferred to LLM tasks and settings.
- We agree that extending our findings to more realistic LLM settings is an important direction. We are actively working on reproducing our study in math problem-solving tasks. In this setup, we start from a model without backtracking capabilities and, as in our toy setting, train both a backtracking model and a direct solution model. We then evaluate their performance under equivalent test-time compute to assess whether our conclusions hold in a more realistic domain.
The methodology part was a little vague to me.
- We will improve the clarity of Section 3 for the camera-ready version. Is there any part (definition, procedure, set-up) that you found particularly confusing?
What’s the search depth of these two problems? Is this the reason why they perform differently on two tasks? It would be good to show some statistics about tree breadth and depth, and more info about that.
-
Search tree statistics for CountDown: We play a version of CountDown with N=4 candidate numbers. At each step we reduce one number by performing a math operation (illustrated in Fig. 2a).Therefore, the tree depth is N-1 = 3. The width (branching factor) is , where N represents the number of candidates left. Concretely, the widths for each depth are: 24, 12 and 4. See further details in Appendix B1 Line 578 for detailed analysis.
-
Search statistics for Sudoku: We play Sudoku boards that on average have 56 unfilled cells (or 81 - 56 = 26 filled cells). Thus the tree depth is on average 56. At each step, we pick the easiest cell to fill (one with fewest candidate numbers). The tree width is determined by the number of candidate numbers and on average the candidates are between 1 - 5.
-
Comparing these two games CountDown has wide but shallow search trees while Sudoku has deep and narrow search trees. We will incorporate these details into the main text in the camera-ready version.
-
Characteristics of the search tree is a key factor differentiating the two tasks. In Appendix D, we directly test this hypothesis by introducing a deeper variant of Countdown and a shallower variant of Sudoku. By making those game variants, we control search tree characteristics and show that it directly impacts the effectiveness of backtracking. In short, our experiments confirm both your intuition and ours.
- For Sudoku, we included an addition analysis where we sort Sudoku problems by searching tree depth and evaluating the effectiveness of parallel sampling at each depth level. In this figure, we observe that the performance of parallel sampling degrades as problem depth increases. As a reference baseline, backtracking (i.e., sequential search) performance remains stable even as depth increases. This supports your intuition that deeper problems are harder to solve via parallel sampling.
I am a little confused about non-backtracking vs direct solution vs parallel sampling. Could you provide more info about what’s the parallel search algorithm here. I think some algorithm tables would be useful to differentiate algorithms here.
- Thank you for pointing this out, and we apologize for the mix of terminology. To clarify: the non-backtracking model and the direct solution model refer to the same thing—we will standardize the terminology to “direct solution model” in the camera-ready version. This direct solution model uses parallel sampling at test time, specifically via a Best-of-N strategy. We agree that a table summarizing the different algorithms would improve clarity and will include one in the revision.
Response acknowledged. Thanks for making the figure and conducting analysis!
Hi reviewer z8oL, thank you for engaging in the conversation!
Since we have promised to address your first concern regarding real tasks and real models, we have conducted analysis using general reasoning models on Math tasks. Please see details below:
To address your concern on whether our synthetic setting can generalize to real models, we have conducted additional experiments to show that when equating test-time compute budget, the effectiveness of backtracking depends on the nature of the task, and our conclusions generalize to real settings. The main result figure can be found here and the full experiment details can be found here. We will include a more comprehensive study based on this result in the camera ready version!
Hello reviewer z8oL, we just wanted to check in to see if you had any thoughts on our updates. We’d be grateful if you could take a look and let us know if it addresses your concerns—your feedback would be very helpful!
They conduct an analysis of how backtracking inside the model's CoT (sequential test-time compute) impacts the performance of LLMs on Countdown and Sudoku, comparing against parallel sampling techniques. They find that on countdown sequential search underperforms parallel, but on sudoku it outperforms, suggesting that which is better depends largely on the task at hand. However, when training with RL models equipped with backtracking learn much more effectively. They also point out. two potential limitations of backtracking: 1) the token inefficiency; and 2) suboptimal data.
接收理由
The analysis in their paper is well executed. They carefully consider different aspects that can effect backtracking performance including task, model size, and data composition, and they do a good job of teasing each of these attributes apart in their experiments. I think this line of work of trying to better understand the scaling trends of different approaches to test-time scaling is really important and contributes greatly to our progress as a field.
拒绝理由
- Their experiments focus mostly on the SFT setting where the model is trained on a bunch of synthetically generated backtracking trajectories. This differs substantially from how general reasoning LLMs like Deepseek-R1 zero were trained. This model instead learns most of their backtracking behaviors from the RL phase, as there is no SFT phase at all for this model. As a result, it is possible that the findings do not transfer to these more general models.
- For each task, they generate a dataset for SFT synthetically. It is likely that some of their conclusions are specific to how the data is generated in each case. They conduct some ablations of this in 4.3, and their results with the mix strategy model suggest that indeed the data is important here. As a result, their conclusions may be more specific to the data and less-so the task itself as they claim.
- In a few places I felt that the writing was overly verbose, such as in Sections 3.1 and 3.2.
Thank you reviewer GNL9 for your comments and feedback! We are very glad that you found our work really important and contributes greatly to our progress as a field.
We address your concerns below:
Their experiments focus mostly on the SFT setting where the model is trained on a bunch of synthetically generated backtracking trajectories. This differs substantially from how general reasoning LLMs like Deepseek-R1 zero were trained.
-
Many papers claim that backtracking is key for enhancing a model's ability for math problem solving [1, 2, 3]. However, many recent follow-up studies [4, 5, 6, 7] show that the only way to teach backtracking behaviors is through pretraining and SFT. In other words, if the backtracking behavior is not present in the base model pre-RL, it cannot emerge through RL. While DeepSeek-R1 zero claims anecdotal evidence regarding the importance of backtracking, we do not know what was in the pretraining data for these models and it is very likely that pretraining data already contains synthetically generated “backtracking” traces.
-
The focus of this work is to understand the impact of backtrackinging on solving reasoning tasks in a highly controlled setting. To that end, our training setup mirrors the typical LLM pipeline: a supervised phase followed by reinforcement learning. Specifically, we first train models with supervised learning to solve Countdown and Sudoku with training data containing solution traces (Section 4). This step is analogous to supervised pretraining in LLMs. We then apply RL to these pretrained models (Section 5), reflecting the post-training RL phase used in models like DeepSeek-R1. While our domain and scale are simplified, the training setup is intentionally aligned with that of general reasoning LLMs.
This model instead learns most of their backtracking behaviors from the RL phase, as there is no SFT phase at all for this model. As a result, it is possible that the findings do not transfer to these more general models.
-
In general, it is not very lear how to define backtracking of general tasks. First, the anecdotal DeepSeek-R1’s backtracking traces [1, 2, 3] are mostly adding a “wait” token and backtracking to the beginning of reasoning steps. While real backtracking consists of checking the wrong step and correcting it. In addition, we do not know how much of traces with backtracking behaviors are already present in their pretraining data.
-
Using synthetic tasks CountDown and Sudoku allows us to first rigorously define backtracking behaviors, and study in a controlled setting. We are the first to study the efficacy of backtracking in the context of test-time compute scaling.
-
To ensure our work translates to general reasoning LLMs, we are working on finetuning LLMs on math tasks (MATH/GSM8K) into a backtracking and a direct-solution model. We will then repeat the same analysis as we have done to CountDown and Sudoku. We will report results in the coming days!
They conduct some ablations of this in 4.3, and their results with the mix strategy model suggest that indeed the data is important here.
-
We agree that data is super important and a choice of suboptimal data can critically degrade the performance. However, we argue that with “reasonable” data with backtracking or direct solution strategy, the nature of tasks determines the performance. We provide two types of evidences on why tasks is important:
-
Evidence 1: For both Sudoku and Countdown, we use the same data generation strategy—DFS-based backtracking traces—as detailed in Sections 3.1.2 and 3.2.2. Despite this controlled setup, the two tasks yield different outcomes, suggesting that task structure plays a critical role in determining the effectiveness of backtracking. While Section 4.3 shows that mixing strategies can improve performance, the backtracking model still underperforms compared to the direct solution model. This supports our claim that while data matters, the task itself remains a key factor.
-
Evidence 2: As additional evidence, Appendix D introduces variations of both tasks: a version of Sudoku with decreased search tree depth and a version of Countdown with increased depth. By manipulating tree depth alone, we directly control the effectiveness of backtracking. These results further support that the observed behaviors are not solely due to data construction, but are fundamentally linked to the structure of the task.
In a few places I felt that the writing was overly verbose, such as in Sections 3.1 and 3.2.
- Thank you for the feedback! We agree that for readers who are experts in LLM reasoning, the background in Sections 3.1 and 3.2 may feel overly detailed. Our goal was to ensure accessibility for readers less familiar with Countdown and Sudoku. That said, we will revise these sections in the camera-ready version to make the writing more succinct while retaining the necessary context.
[1] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948
[2] SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild. https://arxiv.org/abs/2503.18892
[3] s1: Simple test-time scaling. https://arxiv.org/abs/2501.19393
[4] Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs. https://arxiv.org/abs/2503.01307
[5] Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining. https://arxiv.org/abs/2504.07912
[6] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? https://arxiv.org/abs/2504.13837
[7] Decomposing Elements of Problem Solving: What "Math" Does RL Teach? [https://arxiv.org/abs/2505.22756] (https://arxiv.org/abs/2505.22756)
Thank you reviewing our paper! We just wanted to check in to see if you had any thoughts on our rebuttal. We’d be grateful if you could take a look and let us know if it addresses your concerns—your feedback would be very helpful!
Thank you for the rebuttal. I appreciate taking the time to respond to my concerns. A couple followup questions:
Re: "Many papers claim that backtracking is key for enhancing a model's ability for math problem solving [1, 2, 3]. However, many recent follow-up studies [4, 5, 6, 7] show that the only way to teach backtracking behaviors is through pretraining and SFT. In other words, if the backtracking behavior is not present in the base model pre-RL, it cannot emerge through RL. While DeepSeek-R1 zero claims anecdotal evidence regarding the importance of backtracking, we do not know what was in the pretraining data for these models and it is very likely that pretraining data already contains synthetically generated “backtracking” traces."
I don't think this is a strong enough argument to refute the point I was making. While there may be some examples of backtracking in pretraining and that initialize the capability, after RL it may learn to apply the backtracking in a somewhat different fashion than how it was applied in the raw data itself. This is after all a point that is somewhat reaffirmed by your own experiments in 6.2. As a result, I don't think the conclusions drawn from the SFT setting can clearly generalize to how modern reasoning LLMs behave.
Re: "The focus of this work is to understand the impact of backtrackinging on solving reasoning tasks in a highly controlled setting. To that end, our training setup mirrors the typical LLM pipeline: a supervised phase followed by reinforcement learning. Specifically, we first train models with supervised learning to solve Countdown and Sudoku with training data containing solution traces (Section 4). This step is analogous to supervised pretraining in LLMs. We then apply RL to these pretrained models (Section 5), reflecting the post-training RL phase used in models like DeepSeek-R1. While our domain and scale are simplified, the training setup is intentionally aligned with that of general reasoning LLMs."
Yes this is the typical pipeline, but it is not clear that this pipeline matches how modern LLMs learn backtracking specifically. Deepseek R1's paper showed that the backtracking can emerge without SFT. It is unclear whether SFT on backtracking traces is necessary or matches how backtracking is/will be learned in modern LLMs.
Re: "Using synthetic tasks CountDown and Sudoku allows us to first rigorously define backtracking behaviors, and study in a controlled setting. We are the first to study the efficacy of backtracking in the context of test-time compute scaling."
Yes, I agree that these tasks make for an easy testbed for the ideas at hand, and so I don't think the paper should by strictly penalized for the toy nature of the experiments here.
Re: "We agree that data is super important and a choice of suboptimal data can critically degrade the performance. However, we argue that with “reasonable” data with backtracking or direct solution strategy, the nature of tasks determines the performance. We provide two types of evidences on why tasks is important:"
Ok yes, I am willing to accept that there is some task dependence here that goes beyond just the data distribution.
Overall, I appreciate the response and am willing to raise my score a point. I think it is still unclear to me whether the approach of first SFT'ing on raw backtracking traces before RL is the right way to train a reasoning LLM. I cannot strictly fault the authors for this, since I think this is largely an unanswered question in the published literature. However, my reservations somewhat remain because this work does seem to draw its conclusions assuming that models should be SFT'd on backtracking before applying RL.
Thank you GNL9 for your thoughtful response! We fully agree that whether backtracking emerges coming from SFT or RL remains an unclear question and it is an important question the published literature needs to study.
To address your concern on whether our synthetic setting can generalize to real models, at least in the setting where backtracking emerges from SFT (a case well-studied by [3]), we have conducted additional experiments to show that when equating test-time compute budget, the effectiveness of backtracking depends on the nature of the task, and our conclusions generalize to real settings. The main result figure can be found here and the full experiment details can be found here. We will include a more comprehensive study based on this result in the camera ready version!
This paper studies if backtracking data is required for reasoning models. Reviewers all agree that the paper introduces findings for the reasoning community.