PaperHub
7.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
5
4
5
5
3.5
置信度
创新性3.0
质量3.3
清晰度3.3
重要性2.8
NeurIPS 2025

Controlling Thinking Speed in Reasoning Models

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Large Language ModelsInterpretabilityReasoning strategy

评审与讨论

审稿意见
5

This paper introduces a dynamic thinking speed adjustment method for Large Reasoning Models. By analyzing two types of model outputs, i.e., fast, intuition-based answers and slower, reasoning-based responses, the authors propose summarizing a "reading vector" from fast and slow response samples to intervene the model's inference behavior. They first demonstrate that slowing down inference leads to improved reasoning performance. Then, they introduce a heuristic-based dynamic speed control algorithm that enables the model to switch between intuitive and fast reasoning states, achieving comparable results with improved computational efficiency.

优缺点分析

Strengths:

The paper is easy to follow, and it is valuable that it tackles the challenge of controllability across different thinking modes by directly interpreting and intervening in the model's internal architecture. The problem is well-defined within a clear scope, and the experiments demonstrate the effectiveness of the proposed method. The work also raises interesting questions about whether classical tools such as PCA and information theory could further aid in understanding the "flow" of large language models.

Weaknesses:

The proposed method, in its spirit,, appears to be closely related to citation [41]. It may be helpful to better motivate this connection by introducing more of the high-level ideas behind [41], rather than focusing only on methodological overlap. In other words, does the theory in [41] help explain or justify the proposed heuristic intervention approach? If so, drawing that connection more clearly could strengthen the argument of this paper. (Please correct me if I’m mistaken.)

As the authors also acknowledge, the current method is largely heuristic. Some theoretical findings are expected in future research.

问题

  1. Would it be possible to visualize the α\alpha values as a time series over the course of text generation? This could help assess whether the proposed sliding-window algorithm is performing effectively and provide insights into how the control signal influences generation in real time.

  2. What happens if α\alpha becomes too large or too small? How can we ensure that such variations do not disrupt the semantic content of the generation? In other words, is the control direction disentangled from the actual content or the solution being generated?

  3. Some prior work introduces special tokens like "think" or "pause" to encourage slower, more deliberate inference. How does your approach compare to these methods, and are there potential complementarities? (E.g., "Think before you speak: Training Language Models With Pause Tokens" @ ICLR 2024)

局限性

The authors have adequately discussed the limitations in the appendices.

最终评判理由

I thank the authors for providing the additional experiments. Some more suggestions are raised, and I will maintain my score.

格式问题

N/A

作者回复

We are pleased that you find our work valuable, interesting, and our methods effective. We hope the responses below address your concerns.

W1

Q: Does the theory in [41] help explain or justify the proposed heuristic intervention approach?

A: We thank the reviewer for the suggestion to clarify the connection between our work and citation [41] (Representation Engineering, RepE). We agree that RepE inspires and provides a theoretical foundation for our method. But we also see our work as a valuable extension of RepE-like framework.

First, our core motivation for manipulating LLMs' thinking speed stems from the hypothesis that reasoning styles (fast/slow thinking) belong to high-level, abstract cognitive functions within LLMs. This hypothesis aligns with the top-down, representation-centered view in RepE to identify such functions, in contrast to bottom-up circuit-level analysis. Moreover, our observations of keyword-triggered behaviors further support that different thinking modes are organized along directional subspaces in the model's representation space. This insight justifies our use of PCA to identify steering directions.

While our method builds upon RepE's core principles, we significantly extend its framework both theoretically and functionally. Our early experiments revealed that static, global interventions could only govern model responses toward either fast (efficient) or slow (accurate) thinking modes. This limitation constrains our goal of enhancing both accuracy and efficiency simultaneously.

We attribute these constraints to a fundamental limitation in existing RepE-based approaches: their interventions operate along a single functional axis, applying uniform behavior controls across entire generations. To address this, one of our key contributions is the introduction of adaptive control, which reframes the traditional intervention paradigm into a two-dimensional control task by incorporating the temporal dimension. Specifically, we make the intervention reasoning-aware and context-specific by dynamically adjusting the strength of representation control based on the model's evolving internal reasoning state during generation.

This temporal adaptivity fundamentally distinguishes our work by focusing on when and how models should think (dynamic strategy control) rather than what they should think (static behavior control), thereby bridging RepE into the domain of cognitive process control.

W2

Q: The current method is largely heuristic. Some theoretical findings are expected in future research.

A: Though our current methods, such as sliding-window-based adaptive control, remain heuristic, we believe that the core motivation for introducing temporal adaptivity, along with the success of our experimental results and analyses, provides valuable insights into the interpretability of LLMs' reasoning flow and may inspire future work to revisit the representation-control framework.

Q1

Q: Would it be possible to visualize the α\alpha values as a time series over the course of text generation?

A: We provide the following statistical analysis to elucidate our adaptive control method's dynamics. Since the main function of our adaptive control methods is to force LRMs to switch to slow-thinking mode (i.e., set α\alpha to negative values) when encountering difficult reasoning, we specifically study how the frequencies of these switches evolve over time. We first divide the model's responses into 3 temporal segments: start (first 25%), middle (25%-75%), and end (last 25%). We measure switching frequency by calculating the average token interval between these switches (where shorter intervals indicate more frequent switching). Below we present results from DeepSeek-R1-Distill-Qwen-7B.

DatasetAvg. token interval between switches
StartMidEnd
AIME2417.820.954.9
MATH50014.522.541.7
We found that:
  1. The model tends to switch to slow-thinking mode most frequently at the beginning. This is reasonable, as LRMs typically exhibit complex and diverse reasoning behaviors at this stage, such as problem analysis, knowledge point recall, and solution plan development.
  2. The middle segment shows the second-highest switching frequency, as the model begins problem-solving. This phase often involves intensive reasoning and reflection on trial-and-error attempts.
  3. The switching frequency drastically decreases in the final segment. This is expected, as the model usually converges on its final solution and ceases to generate new thoughts. These patterns demonstrate that our control algorithm produces reasoning dynamics that closely align with human cognitive processes.

To understand how this control signal influences the generation, we identify the top-5 most frequent tokens immediately after slow-thinking switches:

[" no", " maybe", " let", " perhaps", " but"]

These hesitation markers consistently signal reflections and reconsiderations in models' CoTs. For quantitative analysis, we include detailed case studies in Figure B1~3, which further validate our method's effectiveness in triggering crucial reasoning behaviors for accurate problem-solving. These findings will be included in our final version of paper.

Q2

Q: What happens if α\alpha becomes too large or too small? How can we ensure that such variations do not disrupt the semantic content of the generation?

A: To study the effect of extreme values of steering intensity α\alpha, we scale the absolute value of α\alpha when applying our thinking speed control to DeepSeek-R1-Distill-Qwen-7B on AIME24. The results are shown below:

α=64\alpha=64321680-2-4-6-16-32
Acc0.06.241.753.352.553.755.452.939.60.8
LengthRepetitive generation1941.26232.58735.612451.214364.615144.815843.420241.7Repetitive generation

The results show that when α|\alpha| becomes excessively large (e.g., α=64\alpha=64 or 32-32), the generation quality degrades significantly, leading to semantic collapse and repetitive outputs.

Regarding the second question, we find that fully disentangling reasoning style from semantic content is challenging. Our experiments in Q1 show that steering with negative α\alpha can induce reflective behavior, as evidenced by increased usage of tokens such as "no" and "but". However, we believe the underlying mechanism is more complex than simple token-level shifts, with minimal semantic impact in most cases. This is evidenced by (1) semantic disruption occurs only with extremely large values of α\alpha, and (2) the accuracy-efficiency trade-off remains stable across a wide range of α\alpha values within reasonable bounds.

The above observations suggest that for reasoning benchmarks, the critical requirement for effective intervention methods is preserving the semantic correctness of generated content. To further assess our intervention methods, we compare our thinking-speed vector to 2 baseline vectors:

  • Hesitation vector: To test whether our control vector merely increases the likelihood of hesitation tokens, we construct a "hesitation" vector using a modified stimulus pair. The negative stimulus remains the same (slow CoT), but the positive stimulus is generated by appending the word "Wait" to its paired negative stimulus. The rest of the vector construction process follows our original protocol. Notably, we observe that this vector leads to semantic collapse and repetition at a much lower intensity ( α=8\alpha' = 8 ) .
  • Correctness vector: We also explore whether our control direction inadvertently overlaps with the model’s sense of solution correctness. We construct a "correctness" vector using paired correct and incorrect solutions from PRM800K as stimuli.

We measure the cosine similarity between our control vector and these two baselines across the last 10 layers of LRMs. The results below suggest that our vector exhibits only minor overlap with the hesitation vector (10\approx 10^\circ) and remains largely orthogonal to the correctness vector. This further suggests that our control direction is not trivially token-driven and does not interfere with task correctness.

Layer
Vectors19202122232425262728
Hesitation81.582.582.779.379.478.279.780.980.885.0
Correctness94.394.493.090.890.593.392.795.593.396.7

Q3

Q: Some prior work introduces special tokens like "think" or "pause". How does your approach compare to these methods, and are there potential complementarities?

A: We thank the reviewer for pointing out the connection. The mechanisms behind their works and ours are quite different.

The pause-token methods introduce explicit delays via special tokens, which requires retraining the model. They extend the thoughts of models by explicitly increasing the number of internal hidden states.

In contrast, our vector-based intervention method introduces no architectural or training modifications. We operate on the LLM's representation space. Our approach is thus lightweight, inference-time only, and generalizable across diverse models without retraining.

Despite these differences, we foresee a combination of our methods with these approaches. Since pause tokens offer additional computational bandwidth, our control signal could also be used to govern when and how that bandwidth is used. For models undergoing pause-token training, we can study the mechanisms for the trigger of pause tokens. This would provide more insights for studying the mechanism of reasoning flows.

评论

Thank you for the detailed rebuttal. I found the new material very interesting, and I believe it would strengthen the paper if these additions were incorporated into the final version. Based on our exchange, I would like to offer two additional suggestions:

  1. Regarding W1: I am not questioning the novelty of the proposed approach. On the contrary, I believe a more thorough discussion of citation [41] would help clarify the conceptual foundation of your method. Without this, the proposed vector tuning strategy may come across as somewhat ad-hoc to readers unfamiliar with the background.

  2. Regarding Q2: I understand there is a potential trade-off between semantic correctness and speed controllability, or perhaps the trade-off is not particularly significant. In either case, it would be helpful if the authors could provide a compact, quantitative summary of this relationship to better inform the reader.

Thank you for providing the additional experiments, and I will maintain my score.

评论

We sincerely thank the reviewer for the kind follow-up and thoughtful suggestions. We're very encouraged that you found the new material interesting and appreciate your additional advice.

  • Regarding W1, we agree that it is important to provide background on the representational theory introduced in [41]. RepE proposes that abstract cognitive functions can be encoded as linear directions in activation space. We hypothesize that fast and slow thinking modes fall within this category, and our findings support this view: different reasoning styles are organized along distinct directional subspaces, and steering vectors corresponding to these directions can be extracted to modulate reasoning behavior. We will add this clarification to the final version of the paper.

  • Regarding Q2, we will include the experimental results from our rebuttal along with a concise summary, highlighting that:

    • Steering remains effective and semantically stable across a wide range of α\alpha values, with semantic collapse only observed at extreme values (e.g., α=64\alpha=64 or α=32\alpha=-32).
    • The steering vector for reasoning speed control remains largely orthogonal to directions associated with correctness judgment and hesitation token likelihood, suggesting a clean separation from semantic content and correctness.

Thank you again for your constructive feedback. We will incorporate these suggestions to further improve the clarity and impact of the final version.

审稿意见
4

This work proposes a method to control the thinking speed of large reasoning models while maintaining or even improving accuracy. It identifies differences in leading words between slow and fast thought processes and introduces a thinking speed control method based on learned representation differences between fast and slow response pairs. The method's effectiveness is demonstrated across various datasets and reasoning models. Additionally, an adaptive speed control method is proposed, leveraging problem difficulty estimation, which shows performance improvements over the non-adaptive approach.

优缺点分析

The identification of distinctions in leading words between slow and fast thinking processes is interesting. However, it is unclear whether this distinction applies to other reasoning models (e.g., o1) or non-reasoning models, as it might be a side effect of a specific reinforcement learning protocol. The thinking speed control method is intriguing, particularly the existence of a vector in the representation space that adjusts thinking speed. Yet, the generalizability of this vector across other reasoning and non-reasoning models remains uncertain. The work would benefit from deeper investigation into these core phenomena to better understand the proposed approach's broader applicability.

问题

Could the authors provide additional results on the generality and transferability of both the leading words and the control vector? Furthermore, how do the results in Figure 5 compare to prompt engineering approaches, such as providing different leading words as illustrated in Figure 2? These points are well addressed by the authors in the responses.

局限性

yes

最终评判理由

I find this to be a technically solid paper. While I am uncertain about its practical applicability, I am inclined towards acceptance.

格式问题

N.A,

作者回复

We are glad that you find our work interesting and our results intriguing. We hope our additional experiments and the responses below could address your concern.

W1 & Q1-part1

Q: Could the authors provide additional results on the generality and transferability of both the leading words and the control vector?

A: We appreciate this valuable suggestion. To evaluate the generality of keyword-triggered behaviors and the transferability of our control vector, we conducted experiments across a diverse set of models, including both non-reasoning and reasoning models:

  • For non-reasoning models, we include Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, which represents different architectures and training protocols.
  • For reasoning models: Since o1 is closed-source with limited API functionality, we instead evaluate the following open-source alternatives: DeepSeek-R1-Distill-Llama-8B and nvidia/Llama-3.1-Nemotron-Nano-8B-v1. Both models are Llama3.1-8B derivatives but employ contrasting post-training approaches (distillation vs. reinforcement learning), enabling direct comparison of RL effects. We additionally evaluate GLM-Z1-9B-0414, which features a distinct architecture from both Qwen and Llama families and is trained with alternative RL protocols, to further assess cross-architectural transferability.

We first evaluate the effectiveness of leading words for triggering slow and fast thinking modes across these models using the MATH500 benchmark.

For non-reasoning models, we treat their original outputs as fast-thinking responses and attempt to induce longer reasoning chains using the most common slow-thinking trigger identified in LRMs (Figure 2), i.e., the word "Okay". The results shown below demonstrate that this trigger fails to consistently extend the models' reasoning processes. This outcome is expected, as these non-reasoning models lack specialized long CoT training and have limited exposure to long reasoning data, making their slow thinking capabilities difficult to activate through simple prompting.

ModelSlow-thinking(Pass@1 (%) / Output Length)Fast-thinking (Pass@1 (%) / Output Length)
Qwen2.5-7B-Instruct74.8 / 654.075.9 / 655.2
Llama-3.1-8B-Instruct43.9 / 1188.741.5 / 702.2

For reasoning models, we treat their original MATH500 responses as slow-thinking outputs and elicit fast-thinking responses using the trigger word "To" (consistent with our main paper). The results below demonstrate consistent keyword-triggered behavior across diverse LRMs, showing robustness to variations in model architecture and training protocols.

ModelSlow-thinking(Pass@1 (%) / Output Length)Fast-thinking (Pass@1 (%) / Output Length)Relative Differences
Llama-3.1-Nemotron-Nano-8B-v193.7 / 3893.491.6 / 2760.6-2.2% / -29.1%
DeepSeek-R1-Distill-Llama-8B88.2 / 3694.163.8 / 878.7-27.7% / -76.2%
GLM-Z1-9B-041496.0 / 2847.191.7 / 1691.8-4.5% / -40.6%

Next, we evaluate the transferability of our control vector extraction method and the generalizability of the resulting thinking speed control across all models. After extracting the control vector from these models using the same method in Section 3, we use the extracted vector to manipulate the models' reasoning behaviors using different steering intensities on AIME24 and MATH500. The results are shown as follows. Despite the differences on the keyword triggering behaviors between non-reasoning and reasoning models, the scaling effects introduced by our intervention methods generalized across all types of models, demonstrating the transferability of our methods. Notebaly, to extract the models' representations for different thinking modes, we reuse the slow and fast CoTs generated by DeepSeek-R1-Distill-Qwen-7B across all tested models. The effectiveness based on this stimulus sharing further highlights that the ability to think in both modes should be an inherent ability shared across all models, which is more likely unrelated to any specific training protocols or certain model families.

ModelsAIME 24 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Non-reasoning
Llama3.1-8B-Instruct8.3 / 783.85.4 / 1293.26.8 / 3501.26.7 / 3787.67.1 / 4748.7
Qwen2.5-7B-Instruct9.2 / 1574.910.4 / 1841.09.6  /  1886.711.7 / 1959.410.8 / 3177.7
Reasoning
Llama-3.1-Nemotron-Nano-8B-v147.1 / 8072.162.1/ 10719.162.9 / 11044.565.0 / 11682.163.3 / 13414.8
DeepSeek-R1-Distill-Llama-8B35.0 / 10589.442.1 / 11164.850.0 / 13398.353.3 / 14709.151.7 / 15776.1
GLM-Z1-9B-041465.0 / 7870.067.1 / 8346.867.5 / 8883.4571.3 / 9585.5772.1 / 10150.1
ModelsMATH500 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Non-reasoning
Qwen2.5-7B-Instruct72.9 / 534.674.0 / 544.275.9 / 655.276.2 / 700.375.4 / 868.0
Llama3.1-8B-Instruct41.6 / 541.842.9 / 599.441.5 / 702.242.8 / 773.939.5 / 815.2
Reasoning
Llama-3.1-Nemotron-Nano-8B-v179.4 / 2117.687.7 / 3091.6493.7 / 3893.494.3 / 4118.994.3 / 4303.09
DeepSeek-R1-Distill-Llama-8B75.9 / 2650.785.1 / 2895.388.2 / 3694.189.6 / 4241.290.2 / 4829.1
GLM-Z1-9B-041494.3 / 2215.096.2 / 2439.996.0 / 2847.196.5 / 3023.396.4 / 3265.5

To summarize, although keyword-triggering behaviors may vary across models, the ability to switch between slow and fast thinking is broadly present across a wide range of LLMs and can be effectively manipulated using our intervention methods. We therefore believe our approach has broad applicability. Moreover, we hope that the insights underlying our methods and experimental findings will offer valuable contributions to future research and deepen our understanding of LLMs' internal reasoning processes.

Q1-part2

Q: How do the results in Figure 5 compare to prompt engineering approaches, such as providing different leading words as illustrated in Figure 2?

A: Thank you for this valuable suggestion. We compared the Figure 5 results against three prompt engineering baselines:

  • Prompt1[1]:
<|User|>[instruction]\nAnswer after a short amount of thinking. Do not spend excessive time double-checking your work.<|Assistant|><think>\n
  • Prompt2 (Fast-thinking):
<|User|>[instruction]<|Assistant|><think>\nTo
  • Prompt3 [2]:
<|User|>[instruction]<|Assistant|><think>\nOkay I have finished thinking.\n</think>\n

Due to figure restrictions during this rebuttal period, we present the performance under each prompt strategy alongside the two closest intervention result points from Figure 5. We experiment with DeepSeek-R1-Distill-Qwen-32B and report the results as follow:

  • MATH500
MethodLength (x-axis)Pass@1 (y-axis)Token Efficiency (Pass@1 / Length, Slope, %)
α=0\alpha=02947.394.23.20
α=4\alpha=42088.792.04.40
Prompt 12079.791.34.39
α=8\alpha=81495.887.65.86
α=12\alpha=121034.581.47.87
Prompt 2967.179.18.18
α=16\alpha=16720.574.310.31
Prompt 3641.779.812.43
  • AIME24
MethodLength (x-axis)Pass@1 (y-axis)Token Efficiency (Pass@1 / Length, Slope, ‰)
α=0\alpha=010679.669.26.47
Prompt 19510.765.46.87
α=4\alpha=48000.160.87.60
α=12\alpha=123893.738.39.84
Prompt 33436.426.27.62
Prompt 23076.327.18.81
α=16\alpha=161784.121.712.16

When plotted on an x-y plane, the results from these prompt engineering methods generally fall below the scaling curve generated by our speed control approach. Moreover, our intervention method offers smoother and more flexible accuracy-efficiency trade-offs, demonstrating clear advantages over traditional prompt engineering techniques.


  1. s1: Simple test-time scaling
  2. Reasoning Models Can Be Effective Without Thinking
评论

Regarding the experiments of transferability, I find a reduced effect in other models. Is it possible to discuss about the relation between the optimal steering vectors of different models? Also, is there side effect by tuning \alpha to a VERY large scale in order to compensate for a reduced effect?

评论

Thank you for your valuable feedback. We hope our following explanations and experiments could address your concerns.

Q1

Q: Is it possible to discuss about the relation between the optimal steering vectors of different models?

A: Yes, we hypothesize that one way to improve the quality of the steering vectors for different models is to use self-generated stimuli during representation extraction. This approach allows each LLM’s unique internal representations of different thinking modes to be better captured.

In our initial experiments, we demonstrate cross-model transferability of our steering methods by using the fast and slow CoTs generated exclusively by DeepSeek-R1-Distill-Qwen-7B (DS-QW-7B) for representation extraction. This approach yields consistent scaling effects under steering across all tested models. To further enhance steering performance, we examine a model-specific variant by replacing the stimuli for each model with self-generated fast and slow thinking data. The results on MATH500 are shown as follows:

ModelsStimuli SourceMATH500 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Llama-3.1-Nemotron-Nano-8B-v1DS-QW-7B79.4 / 2117.687.7 / 3091.6493.7 / 3893.494.3 / 4118.994.3 / 4303.09
Llama-3.1-Nemotron-Nano-8B-v1Self-generated80.4 / 1983.687.6 / 2813.893.7 / 3893.494.3 / 4237.094.6 / 4802.8
GLM-Z1-9B-0414DS-QW-7B94.3 / 2215.096.2 / 2439.996.0 / 2847.196.5 / 3023.396.4 / 3265.5
GLM-Z1-9B-0414Self-generated93.9 / 1981.894.7 / 2303.596.0 / 2847.196.4 / 3305.496.5 / 3851.9

As shown above, using models' self-generated fast and slow CoTs as stimuli leads to a stronger steering effect on its thinking budgets, evidenced by greater variance in response lengths across different α\alpha values. We also observe improved steering quality, reflected in the generally higher accuracies under thought-expansion settings (α<0\alpha < 0).

We also acknowledge that additional factors, such as the quality of the original dataset (e.g., its difficulty) used for slow and fast stimulus sampling, as well as the criteria for selecting representative fast and slow samples, may influence the steering vector's quality. Exploring these variables is a promising direction for future work.

However, we view our work as a first step toward revealing the existence of fast and slow thinking modes in LLMs and unlocking new opportunities for reasoning strategies that leverage these modes to control models' reasoning behaviors. Exploring the underlying mechanisms and formal relationships between optimal steering directions across models remains a promising direction for future research.

Q2

Q: Is there side effect by tuning α\alpha to a VERY large scale in order to compensate for a reduced effect?

A: Yes, but such side effects only emerge when α|\alpha| is pushed to extreme values (e.g., α=64\alpha = 64 or α=32\alpha=-32). As shown in our responses to Reviewers kUJX and duWd, we systematically scale α\alpha and observe that the steering effect remains stable and effective across a wide range of values. Only at the extreme values of α|\alpha| does the model begin to lose semantic coherence, leading to repetitive or degenerate outputs. This behavior is expected, as α\alpha serves as a hyperparameter and, like many others in machine learning, exhibits an effective operating range beyond which model behavior can become unstable.

α=64\alpha=6432α=16\alpha=16α=8\alpha=8α=0\alpha=0α=2\alpha=-2α=4\alpha=-4α=6\alpha=-6α=16\alpha=-16α=32\alpha=-32
Acc0.06.241.753.352.553.755.452.939.60.8
LengthRepetitive generation1941.26232.58735.612451.214364.615144.815843.420241.7Repetitive generation
评论

Dear Reviewer eb6B,

We hope this message finds you well. As the discussion period is nearing its end with less than 48 hours remaining, we wanted to check in to ensure that we have adequately addressed your concerns. Your constructive feedback and suggestions are very important to us, and we would greatly appreciate any further thoughts you might have.

Thank you again for your time and effort in reviewing our paper and rebuttal.

评论

My concerns are adequately addressed. I am willing to increase my score to 4.

评论

Thank you for taking the time to review our rebuttal and follow-up comment! We sincerely appreciate your thoughtful feedback throughout the process, and we will continue to strengthen our work based on your valuable suggestions as well as those from the other reviewers.

审稿意见
5

This paper characterizes the distinction between slow / fast thinking in LLMs, extracts the steering vector capable of controlling thinking speed, and proposes a dynamic decoding method involving difficulty estimation to improve reasoning accuracy while reducing token usage.

优缺点分析

Strengths: The paper is very well-written, with clear structures, great presentation and thorough justifications. Each step is backed by extensive experimentation. Good job!

Weaknesses:

  1. My main reservation is the significance of this work. While the claims are all clear and significant, it seems to me the methods employed are natural and a little lacking in novelty.
  2. The method is not exactly one-shot: it requires forward passing a dataset of slow vs. fast thinking CoTs to extract the steering vector specific to the model. Even though this does not require additional training, I'm not sure if it qualifies as plug-and-play.
  3. Some minor issues, see Questions.

问题

The main question is to address weakness 1:

  1. Can you outline the main challenges you faced during this work and what innovations, if any, are employed to overcome these challenges?

However, I will most likely not lower my score as long as the following questions are addressed.

  1. Can you discuss the claim "plug-and-play" and the time cost of obtaining the reasoning strategy for a pre-trained model?
  2. In Step 3 of 3.1 (Line 142), the PCA step of separating hidden state pairs into two halves seem unnecessary. At first glance, the exact same result can be achieved by calculating the first principal component of all {di(+)}\{ d_i ^{(- \rightarrow +)} \}'s, at least mathematically. Can you discuss on this, either justify the split or revise this paragraph?
  3. Can you further discuss the use of a sliding-window for difficulty evaluation?
  4. It would be great if you can discuss the effects of extreme values of steering intensity α\alpha.

局限性

yes

最终评判理由

The authors defended their paper on my major concerns, so I maintain my view that this is a solid paper worthy of acceptance.

格式问题

N/A

作者回复

We are glad that you find our work solid and well-written. We hope our responses below will address your concerns.

W1 & Q1

Q: My main reservation is the significance of this work. Can you outline the main challenges you faced during this work and what innovations are employed to overcome these challenges?

A: Thank you for your thoughtful comment. We believe the main challenge—and the key innovation of our method—lies in transforming the top-down, global, and static intervention approaches used in prior works [1,2,3] into a bottom-up, fine-grained control mechanism that is both query-adaptive and reasoning-aware.

In our early experiments, we found that traditional fixed intervention methods fall short in capturing the dynamic nature of human-like reasoning (i.e., the fluid transitions between System 1 and System 2 thinking). This stems from a key limitation: most interventions operate along a single functional axis, globally inducing or suppressing certain behaviors across the entire generation. They largely ignore the temporal dimension, which is the fact that a model’s internal reasoning states evolve over time. To address this, we introduce a second axis of intervention: time. This reframes representation intervention as a dynamic, context-aware process.

For interpretability, our difficulty score metric provides a novel framework for analyzing reasoning dynamics. As a reasoning strategy, we extend representation engineering to online, autoregressive generation with temporal adaptivity, which opens new possibilities for context-specific alignment and efficiency optimization.


  1. Representation Engineering: A Top-Down Approach to AI Transparency
  2. ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
  3. Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering

W2 & Q2

Q: The method is not exactly one-shot. Can you discuss the claim "plug-and-play" and the time cost of obtaining the reasoning strategy for a pre-trained model?

A: We use "plug-and-play" to highlight our method's seamless integration into LLM frameworks, offering 3 advantages:

  1. No model structure changes (simple plug-in)
  2. No model constraints (broad generalizability)
  3. No parameter adjustments (zero training cost)

Point 1 is self-explanatory. For Point 2, we tested across diverse architectures (reasoning/non-reasoning models with varying training approaches). To compute the control vector, we reused slow and fast CoTs generated by DeepSeek-R1-Distill-Qwen-7B for representation extraction on all tested models. The results below show consistent scaling effects, confirming the compatibility of our method.

ModelsAIME 24 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Non-reasoning
Llama3.1-8B-Instruct8.3 / 783.85.4 / 1293.26.8 / 3501.26.7 / 3787.67.1 / 4748.7
Qwen2.5-7B-Instruct9.2 / 1574.910.4 / 1841.09.6  /  1886.711.7 / 1959.410.8 / 3177.7
Reasoning
Llama-3.1-Nemotron-Nano-8B-v147.1 / 8072.162.1/ 10719.162.9 / 11044.565.0 / 11682.163.3 / 13414.8
DeepSeek-R1-Distill-Llama-8B35.0 / 10589.442.1 / 11164.850.0 / 13398.353.3 / 14709.151.7 / 15776.1
GLM-Z1-9B-041465.0 / 7870.067.1 / 8346.867.5 / 8883.4571.3 / 9585.5772.1 / 10150.1
ModelsMATH500 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Non-reasoning
Qwen2.5-7B-Instruct72.9 / 534.674.0 / 544.275.9 / 655.276.2 / 700.375.4 / 868.0
Llama3.1-8B-Instruct41.6 / 541.842.9 / 599.441.5 / 702.242.8 / 773.939.5 / 815.2
Reasoning
Llama-3.1-Nemotron-Nano-8B-v179.4 / 2117.687.7 / 3091.6493.7 / 3893.494.3 / 4118.994.3 / 4303.09
DeepSeek-R1-Distill-Llama-8B75.9 / 2650.785.1 / 2895.388.2 / 3694.189.6 / 4241.290.2 / 4829.1
GLM-Z1-9B-041494.3 / 2215.096.2 / 2439.996.0 / 2847.196.5 / 3023.396.4 / 3265.5

Regarding Point 3 and time costs, our extraction process consists of three main steps:

  1. Slow and fast CoTs generation (2 hours);
  2. Hidden states computation (10 minutes);
  3. PCA computation (few seconds).

Timings were measured on an NVIDIA L20 46GB×8 server using DeepSeek-R1-Distill-Qwen-7B and MATH training set (7.5k questions) as stimuli. With reusable stimulus pairs, adaptation to new models takes minutes - far faster than training-based methods [1,2].

We'll open-source pre-computed control vectors for popular models and slow and fast thinking CoTs for representation extraction to enhance plug-and-play usability. We will reconsider the term "plug-and-play" in the final version of our paper.


  1. CoT-Valve: Length-Compressible Chain-of-Thought Tuning
  2. O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

Q3

Q: In Step 3 of 3.1 (Line 142), the PCA step of separating hidden state pairs into two halves seem unnecessary. The exact same result can be achieved by calculating the first principal component of all di(+)d_i^{(-\rightarrow+)}'s.

A: PCA computes principal eigenvectors from the covariance matrix, which captures data dispersion around the mean. Splitting the vector set {di(+)}\{d_i^{(-\rightarrow+)}\} and reversing half shifts the centroid and changes the dataset’s geometry, likely resulting in a different covariance matrix and principal components compared to the original {di(+)}\{d_i^{(-\rightarrow+)}\}.

As a simple example, consider four data points: {1,1,1,1}\{1, 1, -1, -1\}, whose principal direction lies along the x-axis. Flipping the signs of the last two yields {1,1,1,1}\{1, 1, 1, 1\}, collapsing the variance to zero and eliminating any principal direction.

Returning to our experiment in Section 3, our goal is to extract a direction that captures the transition from slow to fast thinking. While this direction may already align with a principal component of {di(+)}\{d_i^{(-\rightarrow+)}\}, by splitting the dataset and reversing half of it, we amplify the variance along the slow->fast axis. This introduces a separating hyperplane, making the target direction easier to isolate and more robust.

To test this, we conduct a controlled experiment using the principal component of {di(+)}\{d_i^{(-\rightarrow+)}\} as a baseline vector. We compare its performance with our method on DeepSeek-R1-Distill-Qwen-7B using AIME24. Results are shown below, where α\alpha is the intervention intensity (as in Figure 5), and each cell reports Pass@1 (%) ↑ / Output Length ↓.

α=16\alpha=16α=8\alpha=8α=4\alpha=4α=0\alpha=0α=4\alpha=-4
Baseline46.7 / 10447.052.5 / 11493.854.6 / 12367.352.5 / 12451.252.5 / 14747.4
Our Paper41.7 / 6232.553.3 / 8735.650.4 / 10942.752.5 / 12451.255.4/ 15144.8

As shown above, our control vector consistently achieves more effective length control and boosts accuracy when thoughts are extended, validating the effectiveness of our extraction protocol.

Q4

Q: Can you further discuss the use of a sliding-window for difficulty evaluation?

A: The sliding-window mechanism detects shifts from straightforward to complex reasoning via sudden increases in the difficulty score (Eq. 4). For such outlier detection, we use a dynamic threshold, defined as the mean plus standard deviation within the window. This method offers two key advantages:

  • Generalizability: Different models and tasks produce varying difficulty score scales. The adaptive threshold removes the need for manual tuning and works robustly across benchmarks.

  • Accuracy–Efficiency Trade-off: As shown in Figure 6 and Table 1, hard reasoning often begins with a spike in difficulty followed by sustained high scores. Our method captures this onset, which guarantees necessary pauses to encourage reflection, and avoids over-intervention afterward. Table 2 and Figures B1–B3 validate this balance.

We provide comprehensive ablation studies in Appendix B to validates the effectiveness of our method. Additionally, we compare our sliding-window approach with a fixed-threshold baseline, where the threshold is set to the average difficulty score computed on each test benchmark. The results (shown below) show that our sliding-window method consistently yields a better accuracy–efficiency trade-off.

MethodsAIME24 (Pass@1 (%) / Output Length)MATH500 (Pass@1 (%) / Output Length)
Original Inference52.5 / 12451.292.9 / 3403.9
Sliding-window control53.8 / 10850.993.7 / 3122.8
Fix-threshold based control52.1 / 13207.494.1 / 3811.72

Q5

Q: It would be great if you can discuss the effects of extreme values of steering intensity α\alpha.

A: To show the effects of extreme values of steering intensity α\alpha, we extend our experiments in Figure 5 by scaling up the absolute value of α\alpha when controlling the thinking speed of DeepSeek-R1-Distill-Qwen-7B on AIME24. The results are shown below:

α=64\alpha=64321680-2-4-6-16-32
Acc0.06.241.753.352.553.755.452.939.60.8
LengthRepetitive generation1941.26232.58735.612451.214364.615144.815843.420241.7Repetitive generation

As shown in the table, increasing α>0\alpha>0 leads to progressively shorter responses. And when α\alpha becomes too large, we observe repetitive generation behavior. Similarly, for α<0\alpha<0, increasing α|\alpha| produces longer responses, eventually causing repetition.

评论

I thank the authors for their detailed and convincing rebuttals, addressing all of my questions and concerns thoroughly. I have no more questions and would keep my score.

To those concerned, I would like to note that my initial comment on PCA was wrong and unwarranted, and I apologize to the authors for the mistake on my part.

评论

We sincerely thank the reviewer for the thoughtful engagement and kind follow-up. We greatly appreciate your acknowledgment and are glad that our responses were helpful in addressing your concerns. Your comments and suggestions have contributed meaningfully to improving the clarity and rigor of our work.

审稿意见
5

This paper proposes using steering vector to control the reasoning speed of language models. It constructs "slow" and "fast" reasoning responses by prompting models with specific keywords and then creates representations for reasoning mode switching. Additionally, they introduce a mechanism to dynamically adjust the model's reasoning behavior by estimating question difficulty through token-level logit variation. Experimental results demonstrate that adaptive control via the steering vector can improve performance while reducing the number of tokens used.

优缺点分析

Strengths:

  1. The paper proposes to use the steering vector to change reasoning behaviors, which effectively controls the fast/slow reasoning behaviors of language models.
  2. The research questions addressed are timely and important. Determining when to switch between reasoning modes remains an interesting direction in the community.
  3. The writing is well-organized and clear, making the manuscript easy to follow.

Weaknesses:

  1. The performance gains appear modest. Across different benchmarks and models, the gains are relatively limited, raising concerns about the practical impact of the proposed method.
  2. The method for constructing fast reasoning data is somewhat heuristic. Since keyword-triggered behavior can vary across models, this approach may introduce inconsistencies and limit generalizability.

问题

  1. How frequently do models switch between reasoning modes? It would be helpful to include some analysis showing how often the model transitions between fast and slow modes during inference.
  2. Is the steering vector always applied? Are there cases where the steering coefficient α\alpha is zero, preserving the model’s original reasoning behavior?
  3. For Qwen-3 model, how is fast reasoning data constructed? Do you try disabling the think mode when building fast reasoning data?

局限性

Yes.

最终评判理由

I have reviewed the authors' response and found that the additional analysis addresses my concerns. Overall, I find this work to be an interesting application of steering vectors for controlling model reasoning behavior. I have increased my score.

格式问题

NA

作者回复

We are glad that you find our work timely and important. We hope our responses below could address your concerns.

W1

Q: The performance gains appear modest, raising concerns about the practical impact of the proposed method.

A: We acknowledge that, as an inference-time method, our performance gains may appear modest compared to optimization-based RL approaches [1,2]. However, we believe the insights offered by our work provide meaningful contributions to understanding and improving LLM reasoning. Specifically:

  • We uncover latent fast-thinking capabilities in LRMs via prompt-based induction and representation-level control.
  • From an interpretability perspective, our difficulty score and case studies offer a novel lens for tracking and analyzing reasoning processes.
  • We highlight a key limitation of existing representation editing techniques[3,4,5], which treat representation control as a static, single-axis intervention problem—typically controlling high-level behaviors (e.g., honesty, helpfulness) globally throughout generations. In contrast, we introduce temporal adaptivity as a second control axis, reframing representation intervention as a dynamic, context-aware process—a shift that opens new directions for future research.

Our method is highly extensible and can be effectively integrated with existing reasoning strategies to achieve non-trivial improvements. For example, combining our speed control (Section 3) with parallel search on DeepSeek-R1-Distill-Qwen-32B (using α=8\alpha=8) leads to significant gains over regular generation, as shown below.

  • AIME24 |Methods|Pass@8|Pass@16|Pass@32|Pass@64| |-|-|-|-|-| |Original (α=0\alpha=0)|62.1|70.3|76.9|83.3| |α=8\alpha=8|69.5|75.2|79.1|83.3|

  • GPQA Diamond |Methods|Pass@8|Pass@16|Pass@32|Pass@64| |-|-|-|-|-| |Original (α=0\alpha=0)|78.3|82.3|85.3|88.4| |α=8\alpha=8|82.4|87.6|91.5|94.4|


  1. O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
  2. Efficient RL Training for Reasoning Models via Length-Aware Optimization
  3. Representation Engineering: A Top-Down Approach to AI Transparency
  4. ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
  5. Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering

W2

Q: Since keyword-triggered behavior can vary across models, this approach may introduce inconsistencies and limit generalizability.

A: We first investigate the generalizability of keyword-triggered behavior across a diverse set of models using MATH500. We examine the generalizability of keyword-triggered behavior in non-reasoning models (Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct). Treating their original outputs as fast-thinking, we prompt extended reasoning using the trigger word "Okay" (inspired by Figure 2). This simple cue fails to elicit longer reasoning, likely due to the lack of specialized CoT training.

ModelSlow-thinking(Pass@1 (%) / Output Length)Fast-thinking (Pass@1 (%) / Output Length)
Qwen2.5-7B-Instruct74.8 / 654.075.9 / 655.2
Llama-3.1-8B-Instruct43.9 / 1188.741.5 / 702.2

For reasoning models, we include Llama-3.1-Nemotron-Nano-8B-v1, DeepSeek-R1-Distill-Llama-8B and GLM-Z1-9B-0414. We treat their original responses as slow-thinking outputs and use "To" (consistent with our main paper) to trigger their fast thinkings. The results shown below validate the generalizability of keyword-triggered behavior across reasoning models.

ModelSlow-thinking(Pass@1 (%) / Output Length)Fast-thinking (Pass@1 (%) / Output Length)Relative Differences
Llama-3.1-Nemotron-Nano-8B-v193.7 / 3893.491.6 / 2760.6-2.2% / -29.1%
DeepSeek-R1-Distill-Llama-8B88.2 / 3694.163.8 / 878.7-27.7% / -76.2%
GLM-Z1-9B-041496.0 / 2847.191.7 / 1691.8-4.5% / -40.6%

We show that our method is highly generalizable across models. Using only fast and slow-thinking data from DeepSeek-R1-Distill-Qwen-7B as stimuli for all models, we extract control vectors from each model and evaluate on AIME24 and MATH500. The consistent scaling effects suggest that fast and slow thinking are intrinsic, shared capabilities of LLMs. This transferability (1) removes the need for model-specific stimuli generation and (2) avoids inconsistencies stemming from keyword-triggered behaviors.

ModelsAIME 24 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Non-reasoning
Llama3.1-8B-Instruct8.3 / 783.85.4 / 1293.26.8 / 3501.26.7 / 3787.67.1 / 4748.7
Qwen2.5-7B-Instruct9.2 / 1574.910.4 / 1841.09.6  /  1886.711.7 / 1959.410.8 / 3177.7
Reasoning
Llama-3.1-Nemotron-Nano-8B-v147.1 / 8072.162.1/ 10719.162.9 / 11044.565.0 / 11682.163.3 / 13414.8
DeepSeek-R1-Distill-Llama-8B35.0 / 10589.442.1 / 11164.850.0 / 13398.353.3 / 14709.151.7 / 15776.1
GLM-Z1-9B-041465.0 / 7870.067.1 / 8346.867.5 / 8883.4571.3 / 9585.5772.1 / 10150.1
ModelsMATH500 ( Pass@1 (%) / Output Length )
α=8\alpha=8α=4\alpha=4α=0\alpha=0α=2\alpha=-2α=4\alpha=-4
Non-reasoning
Qwen2.5-7B-Instruct72.9 / 534.674.0 / 544.275.9 / 655.276.2 / 700.375.4 / 868.0
Llama3.1-8B-Instruct41.6 / 541.842.9 / 599.441.5 / 702.242.8 / 773.939.5 / 815.2
Reasoning
Llama-3.1-Nemotron-Nano-8B-v179.4 / 2117.687.7 / 3091.6493.7 / 3893.494.3 / 4118.994.3 / 4303.09
DeepSeek-R1-Distill-Llama-8B75.9 / 2650.785.1 / 2895.388.2 / 3694.189.6 / 4241.290.2 / 4829.1
GLM-Z1-9B-041494.3 / 2215.096.2 / 2439.996.0 / 2847.196.5 / 3023.396.4 / 3265.5

Q1

Q: How frequently do models switch between reasoning modes? It would be helpful to include some analysis.

A: Thank you for your insightful suggestion. We analyze reasoning mode switch frequency for DeepSeek-R1-Distill-Qwen-7B on math benchmarks, finding averages of 578.6 (AIME24) and 167.2 (MATH500) switches per question. The higher frequency for AIME24 reflects its greater complexity, requiring more slow-thinking engagement.

Next, to better understand the model's reasoning dynamics, we examine how switching frequency varies throughout the thought process. We divide the responses into three temporal segments: start (first 25%), middle (25%-75%), and end (last 25%). We then calculate the average token intervals between these switches (with lower values indicating more frequent switches). The results are presented below:

DatasetAvg. token interval between switches
StartMidEnd
AIME2417.820.954.9
MATH50014.522.541.7

We found that (1) the model most frequently enters slow-thinking mode at the beginning, likely due to initial problem analysis requiring deeper processing; (2) the middle segment also shows high switching frequency, reflecting active reasoning during problem-solving; and (3) switching drops in the final segment, as the model converges on a solution and generates fewer new thoughts.

Finally, we analyze the relationship between internal thinking modes and model outputs. First, we identify the top-100 tokens that most frequently trigger slow-thinking transitions:

TypeMost Frequent Words
Calculationsqrt, denominator, ≈, triangle, product, ...
AnalysisProblem, would, seems, because, find, ...
ReflectionWait, Alternatively, no, maybe ...

Our results demonstrate that models tend to switch to slow-thinking for (1) mathematical computations, (2) logical deductions, and (3) triggering certain reflection behaviors.

To assess the alignment between internal mode switches and external outputs, we measured how often the token "Wait", commonly used as a marker of uncertainty [1,2], coincides with actual transitions to slow-thinking mode. Surprisingly, on AIME24, the model outputs "Wait" 55.3 times per question on average, but only 1 in 12.2 instances aligns with a true mode switch. On MATH500, the ratio is 1 in 15.0.

These results suggest that (1) the overuse of "Wait" reflects the overthinking behavior[1,2], and (2) analyses relying on output tokens to infer reasoning states [1,3] may be misleading, given the weak correlation with internal cognitive transitions.

We hope these findings will provide additional insights for interpreting LLMs' reasoning behaviors and inspire future research in this area.


  1. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
  2. Efficient Reasoning Through Suppression of Self-Affirmation Reflections in Large Reasoning Models
  3. Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Q2

Q: Is the steering vector always applied? Are there cases where the steering coefficient is zero, preserving the model’s original reasoning behavior?

A: The steering coefficient can be zero. Our sliding-window control operates in the following logic:

if current_difficulty_score > current_detection_threshold:
    current_steering_intensity = minimum_steering_intensity
else:
    current_steering_intensity += acceleration_step_size

Since the minimum steering intensity is negative, the intensity will gradually increase (passing through zero) when no high-difficulty signals are detected.

Q3

Q: For Qwen-3 model, how is fast reasoning data constructed?

A: We followed the same protocol described in Section 2 to generate fast-thinking responses (i.e., prompting the model to begin its reasoning with "To"), consistent with all other models in our main experiments.

评论

I have reviewed the authors' response and found that the additional analysis addresses my concerns. I encourage the authors to incorporate these explanations into the final revision. Overall, I find this work to be a good application of steering vectors for controlling model reasoning behavior. I have raised my score accordingly.

评论

We sincerely appreciate the reviewer's positive evaluation and constructive feedback. We are also delighted to know that our additional analyses have adequately addressed your concerns. These explanations, along with the accompanying experimental results, will be included in our final version of the work.

最终决定

This paper proposes to control the reasoning speed of language models using steering vectors derived from prompting. It also introduces a method to automatically adjust the model's reasoning speed by question difficulty estimate via token-level logit variation.

The idea is novel and interesting, and the experiments are comprehensive. The reviewers raised some questions about the generality and theoretic justifications of the proposed methods. The authors did a good job in clearing those concerns.