Let Me Think! A Long Chain of Thought Can Be Worth Exponentially Many Short Ones

审稿意见

评分: 5置信度: 42025-06-11

This paper investigates how to optimally allocate computational resources during reasoning tasks. Through both theoretical analysis and empirical experiments, the authors demonstrate that sequential scaling (e.g., using longer Chain-of-Thought, or CoT) can offer exponential advantages over parallel scaling (e.g., majority voting or best-of-N strategies) in certain graph connectivity problems. The main contribution of the work is the introduction of a reasoning task where slightly reducing the sequential budget requires a significant increase in the parallel budget to maintain the same level of accuracy. The authors also explore the behavior of transformers trained with reinforcement learning (RL) on this task, observing that the length of the generated CoT increases gradually during training.

优缺点分析

Strengths

The paper not only presents a theoretical analysis of the differences between sequential and parallel scaling in reasoning tasks but also validates these predictions through experiments.
The authors design a graph connectivity-based reasoning task that is both challenging and capable of simulating multi-step reasoning, offering an ideal testbed for studying the relative advantages of sequential scaling.

Weaknesses

This paper mainly focuses on a graph connectivity task, which is relatively simple and may not reflect the complexity of real-world reasoning scenarios. I believe deeper analysis should be conducted on more complex reasoning domains such as mathematical problem-solving (e.g., Math, AIME) or code generation (e.g. LivecodeBench) to better validate the generalizability of the findings.
The claim that reinforcement learning leads to an increase in CoT length is questionable. In practical reasoning-enhancement training settings, such as those involving mathematics or coding, they often use supervised fine-tuning to extend the length of reasoning chains first, followed by RL to achieve distribution shifts. In this process, RL does not necessarily result in longer CoTs.
The vertex query model abstracts graph access via an oracle $N_G$ , limiting algorithms to querying only the neighbors of specific vertices. While useful for theoretical analysis, this abstraction may overlook the complex interactions and capabilities of models when processing graph-structured data.
The paper mainly introduced an observed phenomenon—that increasing the model’s generation length significantly boosts reasoning performance. However, similar conclusions have been reported in the previous paper. For instance, methods such as process reward modeling or MCTS for parallel scaling are being increasingly abandoned due to high noise, with recent trends focusing instead on extending the length of reasoning chains with RL.
In addition, I recommend that the paper should propose a new solution under specific scenarios, that is, how to balance sequential and parallel scaling under constrained inference resources. Addressing these points with concrete experiments would not only enhance the robustness of the findings but also increase the overall impact of the work.

问题

See Weaknesses

局限性

Yes

最终评判理由

The authors have addressed most of my concerns. So I decide to raise my score to 4.

格式问题

No

作者回复

2025-07-29

Thank you for the constructive feedback and thoughtful comments. We are glad that you found our graph connectivity task an ideal testbed for theoretical study of the tradeoffs between sequential and parallel scaling for reasoning models and validating it through controlled experiments. Below we address your concerns.

This paper mainly focuses on a graph connectivity task, which is relatively simple and may not reflect the complexity of real-world reasoning scenarios.

Our goal in this paper is to make a general and fundamental claim about tradeoffs between parallel and sequential scaling. Our understanding of these tradeoffs is at its infancy. Multiple papers in the literature have seemingly contradictory claims: some show large benefits from sequential scaling [1], while other papers claim that parallel scaling alone is sufficient [2].

Our goal is to add clarity to the literature, by providing a controllable and analyzable task in which sequential scaling (theoretically & empirically) cannot be efficiently replaced by parallel scaling. Thus, our results yield the general conclusion that the parallel scaling recipes of papers such as [2, 3] cannot work for all types of problems.

Furthermore, we propose the graph connectivity task because it is very natural and captures multi-step reasoning ability, which seems to be a key aspect of more complex reasoning tasks such as math problems. Reasoning on graphs has been considered in the literature as an ideal abstraction of complex reasoning tasks (see [4,5,6,7] for examples), which also isolates the reasoning ability from memorization. Motivated by that, we have designed the bridge graph connectivity task, to capture key aspects of multi-step reasoning.

To be clear, our claim is not that sequential scaling is always better than parallel scaling for all tasks. Rather, it is that there is a natural class of tasks where it is (empirically and theoretically) better. We are the first to formally show such a gap for transformers in a natural setting that is a building block for more advanced multi-step reasoning tasks. This is a conceptual contribution that we believe is helpful for thinking about more complex tasks. We will edit the introduction and abstract to be more clear on this point so as to avoid confusion.

I believe deeper analysis should be conducted on more complex reasoning domains such as mathematical problem-solving (e.g., Math, AIME) or code generation (e.g. LivecodeBench) to better validate the generalizability of the findings.

We have run some evaluations of s1 [1] on GPQA Diamond, and compared sequential and parallel scaling in that setting (we cannot include any plots due to the removal of the option to share a pdf). These experiments show sequential vs. parallel tradeoffs qualitatively similar to our experiments on the graph connectivity task. As argued above, we believe that these experiments are not core to our message and that the paper stands on its own without them. Nevertheless, we will include these experiments in the camera ready paper, since they support our claim that graph connectivity is a helpful benchmark to consider.

The paper mainly introduced an observed phenomenon—that increasing the model’s generation length significantly boosts reasoning performance.

It is indeed known that increasing sequential scale can improve performance, see e.g. [1]. However, this is not the point of our paper.

Instead, our paper studies the tradeoff between parallel and sequential scaling, which is much less understood. Some works argue that parallel scaling alone is enough to get very good performance [2, 3]. Our paper demonstrates that a tradeoff exists between parallel scaling and sequential scaling for a certain natural class of graph connectivity tasks. Thus, our paper pushes back against [2, 3], showcasing a simple setting where parallel scaling alone is not enough.

The claim that reinforcement learning leads to an increase in CoT length is questionable.

We agree that RL is not the only way to enable long CoT reasoning in practice and supervised fine-tuning first was the standard approach prior to the release of DeepSeek-R1 [8, 9]. However, DeepSeek-R1-Zero [10] and its replications [11, 12] showed that even scaling up RL on a base model without supervised fine-tuning can lead to increase in the model's CoT length and accuracy (see Figures 2 and 3 in [10], as well as Figure 1 and Section 4 in [11]). Our study gives one plausible explanation for this phenomena from an expressivity lens of view, demonstrating a clear example of how RL can improve model's accuracy by reinforcing long CoTs the model is expressive enough to compute.

The vertex query model abstracts graph access via an oracle, limiting algorithms to querying only the neighbors of specific vertices. While useful for theoretical analysis, this abstraction may overlook the complex interactions and capabilities of models when processing graph-structured data.

The vertex query model is an abstraction that we empirically certify works extremely well in our from-scratch training experiments (see Figure 4). If transformers had significantly more capabilities in this setting, one would expect them to be able to correctly learn the shortest path, but empirically the model can’t learn it non-trivially better than the vertex query model would suggest. Additionally, the results implied by the vertex query model hold well even with our LLM experiments. The results with the vertex query model are complementary to Theorem 1 (which does not use the vertex query model). Theorem 1 tells us that the separation qualitatively holds, and the Vertex Query model allows us to get a much more fine-grained result in the setting of graph connectivity (albeit with more assumptions).

The paper should propose a new solution under specific scenarios, that is, how to balance sequential and parallel scaling under constrained inference resources.

We have done a grid search over combinations of sequential and parallel scaling (see Figures 1 and 5), showing how different combinations perform. In practice, the optimal tradeoff depends on the specific implementations, hardware constraints, and the setting. To make a fair comparison, one would have to optimize both techniques separately. That being said, we would be happy to add wallclock time comparisons for our experiments to the camera ready version of the paper. Theoretically, sequential scaling should be quadratic (with transformer architecture) in the CoT budget, while parallel should be linear. Since we show an exponential gap, this means that sequential scaling will always be favored in the limit. We are identifying a natural setting where sequential scaling can not be efficiently replaced by parallel scaling, however different specific settings can have different tradeoffs, and in such settings the optimal balance would be different. We believe that future works can use this work as a foundation to investigate the optimal tradeoff for specific settings, constrained to specific computational resources.

Thank you again for your careful reading of the paper and thoughtful suggestions. We hope that we have addressed your questions sufficiently.

[1] Muennighoff, Niklas, et al. "s1: Simple test-time scaling." arXiv preprint arXiv:2501.19393 (2025).

[2] Ma, Wenjie, et al. "Reasoning models can be effective without thinking." arXiv preprint arXiv:2504.09858 (2025).

[3] Brown, Bradley, et al. "Large language monkeys: Scaling inference compute with repeated sampling." arXiv preprint arXiv:2407.21787 (2024).

[4] Abbe, Emmanuel, et al. "How far can transformers reason? the globality barrier and inductive scratchpad." Advances in Neural Information Processing Systems 37 (2024): 27850-27895.

[5] Sanford, Clayton, et al. "Understanding transformer reasoning capabilities via graph algorithms." Advances in Neural Information Processing Systems 37 (2024): 78320-78370.

[6] Xu, Keyulu, et al. "What can neural networks reason about?." arXiv preprint arXiv:1905.13211 (2019).

[7] Kim, Juno, et al. "Metastable dynamics of chain-of-thought reasoning: Provable benefits of search, rl and distillation." arXiv preprint arXiv:2502.01694 (2025).

[8] "Demystifying Long Chain-of-Thought Reasoning in LLMs", Yeo et al. 2025

[9] "Kimi k1.5: Scaling Reinforcement Learning with LLMs", Kimi Team 2025

[10] "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", DeepSeek-AI 2025

[11] "SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild", Zeng et al. 2025

[12] "TinyZero", Pan et al. 2025

2025-08-06

We hope our review addressed your concerns to your satisfaction, sufficiently for you to consider raising your score! We are happy to hear any remaining concerns, confusions, or feedback about the paper.

2025-08-06

Thank you for your response. Considering that you do not conduct experiments on many common real-world reasoning scenarios (i.e. AIME 24, AIME 25), we think that this paper has limitations. So I decide to maintain my score.

2025-08-06

We have run some evaluations of s1 [1] on GPQA Diamond, and compared sequential and parallel scaling in that setting (we cannot include any plots due to the removal of the option to share a pdf). These experiments show sequential vs. parallel tradeoffs qualitatively similar to our experiments on the graph connectivity task. As argued above, we believe that these experiments are not core to our message and that the paper stands on its own without them. Nevertheless, we will include these experiments in the camera ready paper, since they support our claim that graph connectivity is a helpful benchmark to consider.

If you believe that AIME experiments are essential and would be sufficient to raise your score, we can run them in the next 2 days.

2025-08-08

In response to your valuable feedback, and to further assess the generalizability of our conclusions to more complex reasoning tasks, we conducted additional experiments on AIME-2024 and posted the results as an official comment. We hope these results address your remaining concerns and encourage you to consider raising your score.

2025-08-08

Considering that this article has too few experiments in real scenarios and has not yet run the AIME dataset, I decided to lower my score.

2025-08-08

Dear Reviewer 2f9M, Perhaps you did not see our "Official comment" since we posted it as a response to all reviewers. We have pasted its contents below here so that you can more easily see it. The experiments that we have run on AIME support our theoretical results.

--

In response to the reviewers’ feedback, we conducted additional experiments with the s1‑32B model [1] on AIME‑2024. For parallel scaling, we sample with temperature 1.0 and aggregate by majority vote over final answers. For sequential scaling, we limit the model’s thinking‑token budget and force a final answer once the limit is reached. In the ‘wait’ variant [1], we ignore the model’s first output and append the ‘wait’ token to induce further reasoning before the final answer. The experiments used ≈24 H200 GPU‑hours. Since PDF attachments are not supported here, we present the results in the table below.

Parallel (maj@k) / Sequential (token)	500	1k	2k	4k	8k	wait
1	0.067	0.133	0.200	0.333	0.300	0.433
2	0.067	0.133	0.200	0.333	0.300	0.433
4	0.100	0.133	0.233	0.433	0.433	0.500
8	0.067	0.133	0.300	0.433	0.467	0.533
16	0.067	0.167	0.333	0.433	0.500	0.567
32	0.067	0.133	0.400	0.433	0.533	0.600
64	0.067	0.133	0.400	0.433	0.533	0.567
Avg. #Thinking Tokens	500	1000	2000	3998	5092	5522

The results show that sequential scaling can not be efficiently replaced by parallel scaling for this mathematical task, supporting the generalizability of our findings to real-world scenarios. While quantifying the exact trade-off between them for complex mathematical problems such as this is beyond the scope of our fundamental study, we observe that the results confirm our conclusion that sequential scaling is necessary, and challenge claims from other works [2] that it can be entirely replaced by parallel scaling.

To further examine this observation, we break down the results by individual question (out of the 30 total). The table below reports the number of correct responses out of 64 attempts for each question.

Sequential (token budget) / Question ID	1	2	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28
500	0	0	1	0	0	36	12	0	0	0	13	0	0	0	0	2	3	0	0	0	0	0	0	0	0	0
1k	0	1	6	0	0	55	36	4	0	4	48	0	0	2	0	1	1	0	0	1	0	27	0	0	0	0
2k	56	1	20	0	0	63	23	55	0	29	47	0	1	9	1	1	4	12	0	0	12	34	57	0	15	0
4k	58	1	16	0	2	63	22	60	2	43	58	1	4	27	0	3	3	31	0	0	36	50	62	0	33	5
8k	62	2	5	2	24	64	28	63	1	49	58	9	16	40	2	2	16	33	2	0	29	51	63	2	36	8
wait	61	0	5	0	35	64	22	62	6	48	60	12	19	44	7	6	19	34	0	0	31	50	60	2	31	7

One notable takeaway is that there exist questions (e.g. 1, 10, 25) for which parallel scaling at low token budgets will never succeed, but a single shot with a higher token budget is very likely to succeed. This shows the necessity of sequential scaling for achieving the highest possible score on this task. We hope this further clarifies the generalizability of the insights offered by our work.

[1] Muennighoff, Niklas, et al. "s1: Simple test-time scaling." arXiv preprint arXiv:2501.19393 (2025).

[2] Ma, Wenjie, et al. "Reasoning models can be effective without thinking." arXiv preprint arXiv:2504.09858 (2025).

审稿意见

评分: 5置信度: 42025-06-26

This paper investigates the fundamental tradeoff between sequential scaling (longer chains of thought) and parallel scaling (multiple short chains with aggregation) in large language model reasoning. The authors demonstrate through both theoretical analysis and empirical validation that sequential scaling can offer exponential advantages over parallel scaling on graph connectivity problems. They introduce a "bridge graph" task based on graph connectivity and show that small decreases in sequential scale require large increases in parallel scale to maintain performance.

优缺点分析

Strengths and Contributions

Novel theoretical framework: The paper provides rigorous theoretical analysis using transformer expressivity limitations on graph reasoning tasks.
Sufficient empirical validation: Various models are assessed and consistently validate the theoretical predictions.
Practical insights: The work addresses a fundamental question in inference-time scaling that is highly relevant to current LLM development.

Limitations / Weaknesses

Limited scope of tasks: The results are demonstrated only on graph connectivity problems. While the authors acknowledge this limitation, it's unclear how broadly these findings generalize to other reasoning domains.
Noisy Empirical Results: In some plots of empirical results, including the right plot of Figure 1 and the two plots in Figure 5, the results are not visibly exponential. The curve type of the sequential-parallel tradeoff needs to be further determined.
Limited exploration of hybrid approaches: The paper focuses primarily on pure sequential vs. pure parallel scaling, with less investigation of optimal combinations of both approaches.

问题

Overall, this paper makes solid theoretical and empirical contributions to an important problem in current LLM research. However, the limited scope to graph connectivity problems and reliance on unproven complexity-theoretic assumptions prevent it from being a clear accept. The work provides valuable insights into inference-time scaling but needs broader validation to have significant impact. If the authors can adequately address the concerns raised in the Limitations/Weaknesses part, I would be willing to increase my score.

Additional Questions

Limited Empirical Experiment Scope: Though adapting the theoretical analysis could be tough, the authors can easily conduct empirical analysis on sequential scaling and parallel scaling for other LLM reasoning tasks. This will help to better validate the effectiveness of the authors claims in more diverse settings.

局限性

Yes

最终评判理由

The authors have addressed my concerns, so I am inclined to raise my score.

格式问题

None

作者回复

2025-07-31

Thank you for your thoughtful comments and helpful feedback! We are glad that you found our theoretical framework and its empirical validation novel and solid to address the fundamental tradeoffs in inference-time scaling and giving practical insights.

The results are demonstrated only on graph connectivity problems. While the authors acknowledge this limitation, it's unclear how broadly these findings generalize to other reasoning domains

Our goal in this paper is to make a general and fundamental claim about tradeoffs between parallel and sequential scaling. Our understanding of these tradeoffs is at its infancy. Multiple papers in the literature have seemingly contradictory claims: some show large benefits from sequential scaling [1], while other papers claim that parallel scaling alone is sufficient [2].

Our goal is to add clarity to the literature, by providing a controllable and analyzable task in which sequential scaling (theoretically & empirically) cannot be efficiently replaced by parallel scaling. Thus, our results yield the general conclusion that the parallel scaling recipes of papers such as [2, 3] cannot work for all types of problems.

Furthermore, we propose the graph connectivity task because it is very natural and captures multi-step reasoning ability, which seems to be a key aspect of more complex reasoning tasks such as math problems. Reasoning on graphs has been considered in the literature as an ideal abstraction of complex reasoning tasks (see [4,5,6,7] for examples), which also isolates the reasoning ability from memorization. Motivated by that, we have designed the bridge graph connectivity task, to capture key aspects of multi-step reasoning.

To be clear, our claim is not that sequential scaling is always better than parallel scaling for all tasks. Rather, it is that there is a natural class of tasks where it is (empirically and theoretically) better. We are the first to formally show such a gap for transformers in a natural setting that is a building block for more advanced multi-step reasoning tasks. This is a conceptual contribution that we believe is helpful for thinking about more complex tasks. We will edit the introduction and abstract to be more clear on this point so as to avoid confusion.

Finally, we have run some evaluations of s1 [1] on GPQA Diamond, and compared sequential and parallel scaling in that setting (we cannot include any plots due to the removal of the option to share a pdf). These experiments show sequential vs. parallel tradeoffs qualitatively similar to our experiments on the graph connectivity task. As argued above, we believe that these experiments are not core to our message and that the paper stands on its own without them. Nevertheless, we will include these experiments in the camera ready paper, since they support our claim that graph connectivity is a helpful benchmark to consider.

In some plots of empirical results, including the right plot of Figure 1 and the two plots in Figure 5, the results are not visibly exponential.

Our results only show that the tradeoff is at least exponential. In those plots, the contours of equal loss appear to possibly grow faster than exponential, so do not contradict our results. It is not too surprising that the LLM experiments don’t perfectly match the smaller scale ones, since LLMs are not trained on this specific task, so will spend tokens figuring out what solution they will try, and won’t make assumptions about the structure of the graph and therefore are likely to attempt suboptimal solutions. We will make sure to add more discussion and concrete examples of this observation to the paper.

The paper focuses primarily on pure sequential vs. pure parallel scaling, with less investigation of optimal combinations of both approaches.

We have done a grid search over combinations of sequential and parallel scaling (see Figures 1 and 5), showing how different combinations perform. In practice, the optimal tradeoff depends on the specific implementations, hardware constraints, and the setting. We believe that future works can use this work as a foundation to investigate the optimal combinations for specific settings.

Reliance on unproven complexity-theoretic assumptions.

The assumption that $\textrm{TC}^0 \neq \textrm{L}$ is a standard assumption (see [8,9]) commonly held by complexity theorists (similarly to the conjecture that $\textrm{P} \neq \textrm{NP}$ ). We explicitly mention this assumption whenever we use it, and also have theoretical results (see Theorem 1) that do not rely on this assumption.