PaperHub
7.3
/10
Poster4 位审稿人
最低3最高5标准差0.9
5
5
3
5
3.5
置信度
创新性3.0
质量3.3
清晰度3.0
重要性2.5
NeurIPS 2025

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

OpenReviewPDF
提交: 2025-04-19更新: 2025-10-29

摘要

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to efficiently assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of cross-domain reasoning benchmarks, SpecReason achieves 1.4-3.0$\times$ speedup over vanilla LRM inference while improving accuracy by 0.4-9.0%. Compared to speculative decoding without SpecReason, their combination yields an additional 8.8-58.0% latency reduction. We open-source SpecReason at https://anonymous.4open.science/r/specreason/.
关键词
Inference-Time ComputeLarge Reasoning ModelsEfficient InferenceSpeculative ExecutionSystems for Machine Learning

评审与讨论

审稿意见
5
  • This paper proposes SpecReason, a system to accelerate inference for Large Reasoning Models (LRMs).
  • It uses a lightweight model to speculatively perform intermediate reasoning steps, invoking the base LRM only for verification and correction.
  • Unlike prior speculative decoding, SpecReason exploits the semantic flexibility of reasoning steps without requiring token-level equivalence.
  • On multiple reasoning benchmarks, it achieves 1.4×-3.0× speedup and 0.4%-9.0% accuracy improvement over standard LRM inference. Combined with speculative decoding, it delivers an additional 8.8%-58.0% latency reduction.

优缺点分析

  • Strengths:

    (1) The paper makes an interesting observation in Section 3, identifying three key properties of LRM reasoning: Intermediate steps are easier than end-to-end reasoning; Reasoning primarily depends on semantic insight rather than exact token outputs; Occasional errors can be corrected through reflection. This analysis is insightful and provides a strong motivation for the proposed method.

    (2) The paper is clearly written and well-structured, making the methodology and results easy to follow.

    (3) As shown in Figure 3, the experimental results effectively demonstrate the advantages of SpecReason. Moreover, the combination of SpecReason with speculative decoding achieves even better latency reductions, verifying the method’s compatibility with existing techniques.


  • Weaknesses:

    (1) As described in Lines 196–202, the method uses DeepSeek-R1 1.5B as the lightweight model to generate 70 new tokens (one reasoning step), and a larger model (QWQ-32B or DeepSeek-R1-70B) to score the step. Steps with a score ≥7 are accepted, while others are regenerated by the large model. This heavily relies on the small model’s acceptance rate — if it is low, the overall efficiency gain diminishes. Moreover, using the large model for step evaluation introduces potential additional latency. It would be valuable to provide guidelines on suitable model size ranges. Section A.1 addresses part of this concern, but it would be helpful to see corresponding SpecReason+Decoding results in that section as well.

    (2) Since many related speculative reasoning and decoding works have emerged recently, it would be helpful if the authors could include a brief discussion in the Appendix comparing their approach with concurrent methods.

    (3) No other major weaknesses identified.

问题

  • In Line 200, the paper mentions generating “∼70 new tokens” per reasoning step. Is this number empirically determined from observations, or is it a fixed hyperparameter?

  • Have you considered or observed cases where the LRM produces repeated reasoning steps or revises earlier steps? Could SpecReason’s scoring or acceptance mechanism be extended to reuse previously generated steps to further improve efficiency?

  • Since it can be difficult to clearly delineate reasoning steps in many LRM outputs, how is step segmentation controlled in your implementation? Is it based on a fixed token length, semantic boundary detection, or other heuristics?

局限性

See "Strengths And Weaknesses".

格式问题

No issues.

作者回复

Thank you for your detailed feedback! We appreciate that the reviewer finds our insights interesting and valuable. Below, we respond to individual questions.

Question 1: Prefilling 70 tokens in each verification step

The ~70 tokens come from the concise yet effective base model prompt that scores the speculated reasoning step, and this number is derived from our design choice. We will include the base model prompt in the final version of the paper. We note that the current implementation of SpecReason allows for extensive customizations in future work, such as alternative concise, yet effective, scoring prompts.

Question 2: Have you considered or observed cases where the LRM produces repeated reasoning steps or revises earlier steps? Could SpecReason’s scoring or acceptance mechanism be extended to reuse previously generated steps to further improve efficiency?

Empirically, LRM revising earlier steps is a widely observed phenomenon (e.g., DeepSeek-R1’s Aha moment [1] suggests that reflection behavior emerges during training; Sky-T1 [2] similarly observed redundant and excessive reasoning steps). Reusing scores from prior similar steps is indeed a viable way to reduce the prefilling time of the base model’s scoring prompts. However, we did not adopt this strategy because we observed that the end-to-end latency of SpecReason (and LRM inference in general) is dominated by the aggregated time-per-token (TPT) latency rather than the total prefilling time across verification rounds. Due to the sheer number of tokens required per request, score reuse would result in only marginal reductions in end-to-end latency.

Question 3: How are boundaries between steps defined

In our current implementation, we simply use each occurrence of a double newline character ('\n\n') as the divider between reasoning steps. SpecReason also supports other segmentation markers, such as ending a step after each period ('. ').

Weakness 1: Reliance on the small model’s acceptance rate

We acknowledge that although we did not observe low acceptance rates across the diverse workloads and model combinations we evaluated, the small model should be of a reasonable size to ensure it can generate a substantial portion of reasoning steps that preserve semantic meaning. In a real system deployment scenario, we foresee a runtime monitor that continuously tracks the acceptance ratio on a per-request basis. For requests that exhibit a consistently low acceptance rate, SpecReason would fall back to vanilla base model generation to avoid the latency overhead of drafting and verification.

Weakness 1: Latency overhead of base model verification

SpecReason introduces only marginal latency overhead thanks to its efficient design. As noted in our response to Question 2, each verification is a single prefill, which efficiently parallelizes token processing and results in latency equivalent to the TPT of only 1–2 tokens. In contrast, the end-to-end latency of SpecReason (and LRM inference in general) is dominated by the decoding time of the thinking tokens.

Weakness 2: Related work on speculative reasoning

We acknowledge that many papers have been published on arXiv since we submitted our paper in April, and we will include a brief discussion of these works in the final version of our paper. As a preview, the primary differences between SpecReason and related work are:

  1. Accuracy preservation: SpecReason maintains accuracy (compared to the base model) for state-of-the-art models and datasets, whereas other approaches often exhibit accuracy degradation for speedups (e.g., Speculative Thinking [3]).
  2. Training-free design: SpecReason is a training-free approach that works with off-the-shelf models, whereas related work (e.g., SplitReason [4]) requires resource- and time-intensive retraining or fine-tuning.

References

[1] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[2] Griggs, T., Cao, S., Li, D., Liu, S., Patil, S. G., Zaharia, M., Gonzalez, J., & Stoica, I. (2025). Think Less, Achieve More: Cut Reasoning Costs by 50% Without Sacrificing Accuracy. Blog post.

[3] Yang, W., Yue, X., Chaudhary, V., & Han, X. (2025). Speculative thinking: Enhancing small-model reasoning with large model guidance at inference time. arXiv preprint arXiv:2504.12329.

[4] Akhauri, Y., Fei, A., Chang, C. C., AbouElhamayed, A. F., Li, Y., & Abdelfattah, M. S. (2025). Splitreason: Learning to offload reasoning. arXiv preprint arXiv:2504.16379.

评论

Thank you for the insightful review. Can you please check how well authors' response could address your questions/concerns? and share any feedback to the authors? This is valuable for the authors and the decision-making review process.

审稿意见
5

The paper introduces SpecReason, a novel framework designed to accelerate inference in Large Reasoning Models (LRMs) by leveraging the inherent approximation tolerance of intermediate reasoning steps. Unlike conventional speculative decoding approaches that require token-level equivalence between draft and base models, SpecReason operates at the level of semantic similarity, enabling a lightweight model to speculatively generate intermediate reasoning tokens that are subsequently assessed by the base model for their utility in advancing the reasoning trajectory. The method exposes configurable mechanisms to balance accuracy and latency, such as adjustable acceptance thresholds and selective delegation of initial reasoning steps to the base model. Extensive experiments on diverse reasoning benchmarks demonstrate that SpecReason achieves substantial reductions in inference latency—ranging from 1.4× to 3.0× compared to vanilla LRM inference—while also yielding accuracy improvements of up to 9%. Furthermore, the authors show that SpecReason is complementary to speculative decoding, and their combination yields additional latency gains, highlighting the practical relevance of the proposed approach in scaling efficient reasoning-centric applications.

优缺点分析

Strength

  1. SpecReason deals with the problem of high inference latency in LRMs by proposing an acceleration framework that does not compromise accuracy. The method is effective and readily transferable, demonstrating promising potential for practical deployment.

Weakness

  1. SpecReason relies heavily on the base model’s ability to accurately assess the utility of speculative steps. While preliminary evidence is provided (e.g., correlation with process reward models), this dependency could limit robustness if the base model’s evaluation quality degrades in more diverse or noisy settings.

问题

see weakness

局限性

see weakness

最终评判理由

The authors addressed my concerns in rebuttal, so I maintain my original score.

格式问题

No

作者回复

Thank you for your detailed feedback! We appreciate that the reviewer finds our insights interesting.

Weakness: SpecReason’s reliance on the base model’s assessment capability

We acknowledge that a robust scoring/assessment mechanism is indeed a requirement; however, we find that the latest reasoning models are indeed capable of making accurate assessments (see our results in Fig. 3). We are also happy to demonstrate that not all models, e.g., smaller reasoning models, can consistently make accurate assessments. Furthermore, SpecReason doesn’t hinge on the base LRM’s assessment capabilities; it generalizes to alternative approaches for evaluating the quality of speculated reasoning steps, such as a lightweight verification model (commonly used in other workloads like RL rollout scoring) or leveraging the confidence of tokens generated by the smaller draft model.

评论

Thank you for the insightful review. Can you please check how well authors' response could address your questions/concerns? and share any feedback to the authors? This is valuable for the authors and the decision-making review process.

评论

I have read the rebuttal and will update the score accordingly.

审稿意见
3

Inference-time compute has emerged as a key factor for scaling AI capabilities, particularly in Large Reasoning Models (LRMs), which perform complex tasks by generating long CoTs. However, these models face significant latency issues. To address this latency without compromising accuracy, the paper introduces SpecReason, an approach that offloads simpler intermediate reasoning steps to smaller, faster speculative models. SpecReason leverages two critical insights: first, reasoning difficulty is heterogeneous, meaning many intermediate steps are relatively simple and can be effectively handled by lightweight models; second, reasoning effectiveness depends primarily on semantic insight rather than exact token fidelity, allowing for approximate yet meaningful step generation. SpecReason employs a lightweight model to speculate reasoning steps, which are subsequently evaluated by a more capable base model. This method significantly reduces latency by selectively accepting semantically adequate speculative steps.

优缺点分析

Strengths:

  1. The paper is well-written and easy to follow.
  2. Achieving fast and accurate inference is important for both LLMs and LRMs.
  3. The extension of the "speculative" concept from decoding to reasoning is interesting and innovative.

Weaknesses:

  1. It is unclear whether this method applies exclusively to LRMs or could also be effectively employed in LLMs. The paper presents it as potentially general, yet does not explicitly explore its broader applicability.
  2. In line 36, the authors claim "without compromising accuracy." However, can this method guarantee 100% identical outputs compared to baseline methods? If not, this statement is imprecise. Unlike speculative decoding, which ensures lossless acceleration and perfect accuracy, this speculative reasoning approach may inherently introduce slight inaccuracies. The authors should clarify this distinction and avoid overstating the claim.
  3. Several relevant recent works [1, 2] addressing the "overthinking" problem [3] are omitted. Including these studies would strengthen the related work section and clearly situate the paper within existing research.
  4. In Section 3 Motivation, key assumptions such as "intermediate steps are easier" and "reasoning progress depends on insights" are presented without sufficient experimental evidence or external references. Additional experiments or citations are required to substantiate these claims effectively.
  5. The use of references formatted as "[...]" is somewhat unconventional.
  6. Consistent with point 3, the paper lacks direct comparisons with state-of-the-art methods [1, 2] addressing similar challenges. Including explicit comparative analyses would significantly bolster the validity and impact of the presented approach.

[1] O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.

[2] L1: Controlling how long a reasoning model thinks with reinforcement learning.

[3] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

问题

Please check points 3, 4, and 6 in the weakness for missing related works and comparisons, support for motivation.

局限性

yes

最终评判理由

The authors addressed parts of concerns; there is no comparison with the current SOTA method, so I keep my score

格式问题

The reference format is strange with "[...]".

作者回复

Thank you for your detailed feedback! We appreciate that the reviewer finds our insights interesting and innovative. Below, we respond to individual questions.

Weaknesses 3 & 6: Related work on addressing the overthinking problem

We thank the reviewers for pointing us to the related work. A key distinction between SpecReason and recent work [1,2] is that SpecReason is a training-free approach that can be applied to any off-the-shelf model, whereas related work [1,2] requires fine-tuning or training LRMs, which is a time- and resource-intensive process. If time permits, we will add a baseline that runs vanilla inference using a reasoning model fine-tuned from one of the base models in our evaluation to reduce overthinking.

Weakness 4: Experimental evidence of our insights

Our first key insight, "intermediate steps are easier than end-to-end reasoning", is the cornerstone assumption of chain-of-thought reasoning [4]. Our third key insight, “LRMs can correct occasional mistakes via self-reflection”, is a widely observed phenomenon in LRMs (e.g., DeepSeek-R1’s Aha moment [5] suggests that reflection behavior emerges during training). We will add the corresponding references to related work.

Our second key insight, "reasoning progress depends on insights, not exact tokens", is our novel contribution. It is supported by empirical results showing that relaxing token-level equivalence to semantic equivalence or similarity still preserves quality, as demonstrated in our main results (Fig. 3). We will add additional explanations illustrating that token-level equivalence is indeed not achieved by the smaller reasoning model, yet this relaxation is sufficient from an accuracy perspective. For example, we will include example reasoning trajectories comparing vanilla inference with SpecReason in the final version of our paper.

Weakness 1: Extension to LLMs

SpecReason targets LRM inference, which is the driving force behind inference-time scaling, making our focus relevant to a wide range of practical workloads. While our method is tailored to LRMs, it naturally extends to LLMs – particularly in scenarios that elicit chain-of-thought (CoT) decoding [4]. This is because our core insights – that intermediate steps are easier, and that reasoning progress depends on insights – are rooted in the CoT paradigm, where complex tasks are broken down into simpler, sequential subproblems. We will include a discussion on the applicability of SpecReason’s techniques in the final paper.

Weakness 2: Accuracy compromise

We agree with the reviewer that, unlike speculative decoding, SpecReason does not guarantee token-by-token equivalence. This is by design: our key insight is that relaxing the constraint for exact token matches enables the use of alternative, potentially imprecise reasoning tokens – relative to the base model – while still achieving the same final output accuracy. This flexibility opens up new opportunities for latency reduction. We thank the reviewer for pointing this out and will revise the paper to clarify this distinction and explicitly note that SpecReason's accuracy-preserving property is demonstrated empirically rather than guaranteed theoretically.

References

[1] - [3]: See the reviewer's comment

[4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

[5] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

评论

Thank you for the insightful review. Can you please check how well authors' response could address your questions/concerns? and share any feedback to the authors? This is valuable for the authors and the decision-making review process.

评论

The authors addressed parts of concerns; there is no comparison with the current SOTA method, so I keep my score

审稿意见
5

The paper presents a technique called SpecReason for accelerating the inference of large reasoning models by using a comparatively smaller model as a draft model. The larger model verifies the small model’s proposed next steps by giving a discrete score from 0 to 9. The step is accepted or rejected based on a user-defined threshold. In the evaluation, SpecReason shows 10%-50% improvement in speed compared to Speculative Decoding in MATH, AIME, and GPQA benchmarks.

优缺点分析

Strengths:

  1. A simple and practical idea for a crucial problem of inference latency of large reasoning models.
  2. The evaluation results are consistent with 10-50% improvement over standard Math benchmarks.

Weaknesses:

  1. The evaluation is only focused on Math tasks and does not show the generality of the technique on other domains.
  2. The motivation section alludes to several insights about LRMs. However, these insights are stated without any empirical evidence. Can authors include empirical analysis on the confidence scoring patterns that assess the difficulty of each step in the reasoning trajectory?
  3. The base model is prompted to generate the score draft reasoning step. The authors should include empirical analysis on using confidence based on the logits of the speculated step.
  4. Although the main experiments are performed with k=16 samples, the results do not report the error bars as stated in the Checklist.

Minor:

Line 50: shown in Fig. 3, this opens the door to significantly faster inference without sacrificing output quality. - Can you avoid pointing to Fig. 3 here

Section 4.1: A use of small pseudocode may improve the readability of Section 4.1

Section 4.2: it repeats what is discussed in the background section discussion on speculative decoding.

While comparing with vanilla LRM, the evaluation uses the term “speedup” and comparison with speculative decoding term “latency” is used. I would recommend using single terminology for consistency.

问题

How can this technique be generalized to domains beyond mathematics?

局限性

Yes

最终评判理由

Authors have addressed my concerns and promised to include experiments that would improve the paper.

格式问题

No

作者回复

Thank you for your detailed feedback and suggestions on the presentation! We will incorporate them into the final version of the paper. We also appreciate that the reviewer finds our paper interesting and practical. Below, we respond to individual questions.

Question 1/Weakness 1: The evaluation is focused on math tasks and does not show the generality of the technique on other domains

SpecReason is designed to generalize to a broad range of reasoning workloads like question answering (e.g., GPQA) and code generation (e.g., HumanEval). We note that our evaluations included GPQA, a general-purpose question answering dataset with questions in biology, physics, and chemistry, and we found that SpecReason's wins generalize to domains beyond math workloads. The three key insights presented in §3 are broadly applicable to any task where reasoning models – those that generate intermediate thinking tokens before summarizing and answering – are primarily used. To further strengthen our evaluation, if time permits, we plan to include representative datasets from additional workload categories (e.g., code generation) in the final version of our paper.

Weakness 2: Empirical analysis on the insights and the confidence scoring patterns

Our first key insight, "intermediate steps are easier than end-to-end reasoning," is the cornerstone assumption of chain-of-thought reasoning [1]. Our third key insight, "LRMs can correct occasional mistakes via self-reflection," is a widely observed phenomenon in LRMs (e.g., DeepSeek-R1’s Aha moment [2] suggests that reflection behavior emerges during training). We will add the corresponding references to related work.

Our second key insight, "reasoning progress depends on insights, not exact tokens," is our novel contribution. It is supported by empirical results showing that relaxing token-level equivalence to semantic equivalence or similarity still preserves quality, as demonstrated in our main results (Fig. 3). We demonstrate SpecReason’s confidence score patterns in Fig. 1, but we will be happy to include additional example reasoning trajectories comparing vanilla inference with SpecReason (alongside the confidence score patterns) in the final version of our paper.

If the reviewer has specific suggestions for experiments to empirically analyze the confidence scoring patterns (e.g., how the average confidence scores across requests correlate with final answer accuracy, or how the acceptance rate correlates with the acceptance threshold), we will be happy to ask for clarifications during the reviewer–author discussion stage and address them in the final version of the paper.

Weakness 3: Empirical analysis on using confidence based on the logits of the speculated step rather than prompting the base model

This is an interesting idea – we speculate that SpecReason’s current approach will outperform this scheme, since our current scoring mechanism relies on a larger, more capable model for assessment rather than on the information (i.e., logits/confidence) within the smaller draft model. If time permits, we will include a microbenchmark of this in the final version of the paper.

Weakness 4: Error bars

We thank the reviewer for pointing this out, and we will add error bars to our main results.

References

[1] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

[2] Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

评论

I thank the authors for the rebuttal and addressing my concerns. Considering GPQA dataset as a Math dataset is misunderstanding on my part and indeed the evaluation considers sufficiently diverse benchmarks. I have raised my score to 5 accordingly.

If time permits, we will include a microbenchmark of this in the final version of the paper.

There is ample time to perform this experiment until camera-ready if the paper gets accepted. I encourage authors to include this.

评论

Thank you for the insightful review. Can you please check how well authors' response could address your questions/concerns? and share any feedback to the authors? This is valuable for the authors and the decision-making review process.

评论

Hi -- thank you again for your thoughtful review! We wanted to follow up in case any part of our response, especially regarding GPQA as a non-math dataset, was unclear or could benefit from discussions and clarifications. We’d be grateful for any further feedback you’d like to share on how to improve our paper. Thanks so much!

最终决定

This paper addresses the challenge of accelerating inference for Large Reasoning Models (LRMs) by leveraging inference-time computation. The proposed approach employs a lightweight model to speculate individual reasoning steps, which are then assessed by a stronger base model that assigns a score between 1 and 9. If the score exceeds a threshold (e.g., ≥7), the lightweight model’s output is accepted; otherwise, the base model regenerates the step. While the mechanism is conceptually simple, the paper builds on key insights into reasoning: (1) reasoning difficulty is heterogeneous—many intermediate steps are relatively simple and can be effectively handled by smaller models; and (2) reasoning quality depends more on semantic soundness than on exact token fidelity, enabling approximate but meaningful step generation. These are important contributions to the emerging and highly active area of LRMs. The main weakness of the proposed approach is the reliance on prompting a stronger model for step scoring, which introduces additional latency, computational cost, and possible robustness issues. Moreover, the paper lacks sufficiently thorough comparisons with concurrent methods, which weakens the evaluation. The authors have responded to the identified weaknesses and promised to update the draft with additional discussions, more comprehensive comparisons, and further clarifications. If these commitments are fulfilled in the final version, the paper will be substantially strengthened and could make a meaningful contribution to advancing efficient inference for LRMs.