6.4

/10

Poster4 位审稿人

最低2最高5标准差1.2

4.0

置信度

创新性3.3

质量3.0

清晰度3.0

重要性3.3

NeurIPS 2025

Mitigating Overthinking in Large Reasoning Models via Manifold Steering

Yao Huang,Huanran Chen,Shouwei Ruan,Yichi Zhang,Xingxing Wei,Yinpeng Dong

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We propose manifold steering that projects the steering direction of model overthinking on the low-dimensional activation manifold, effectively reducing output tokens while maintaining accuracy.

摘要

关键词

Large Reasoning ModelsOverthinkingMechanistic InterpretabilityManifold Steering

评审与讨论

审稿意见

评分: 4置信度: 42025-06-16

This paper aims to mitigate overthinking in large reasoning models (LRM) by adjusting their internal activations. Based on the assumption that activations corresponding to overthinking reasoning traces can be distinguished from concise ones, the paper introduces a correction term derived from the difference-in-means from overthinking (redundant) and concise reasoning traces (in math). However, the paper finds that such a correction doesn’t always guarantee redundancy reduction. As a remedy, the paper introduces manifold steering, where the correction term is projected to a low-dimensional space (manifold). Experimental results on math (in-domain), code generation, and disciplinary knowledge (out-of-domain) demonstrate a reduction of reasoning trace lengths while achieving the same or better accuracy.

优缺点分析

Strength The paper addresses an important issue in LRMs. Indeed, empirical results show that the proposed method can mitigate overthinking in R1-based models.

Weakness I see the main weakness in the major flaws of the formal derivations of the paper, which might impact the actual results of the paper. In that regard, the paper needs major rewriting.

Paragraph “Linear Low-Dimensional Manifold Verification” states that the steering direction is composed of an overthinking ( $r_{overthinking}$ ) and an error ( $r_{other}$ ) term by performing a PCA on joint redundant and concise reasoning traces. However, I don’t see how a PCA shows orthogonality between the two by doing a joint PCA. The PCA just shows the concentration of the variance in the first k=10 dimensions (of the PCA), no separation between different traces.
Moreover, the comparison between the magnitudes of $r_{M}$ and $r_{other}$ are confusing. Step 2 in the proof of Theorem 4.2 states that the magnitude of the error term is much larger than the $r_{M}$ . However, $r_{M}$ contains 70% of the eigenvalues, hence it has a larger norm!
Theorems 4.1 and 4.2 are based on empirical evidence; hence, they cannot be stated as formal theorems. When being framed as a formal theorem, all assumptions have to be stated inside the theorem.
Step 1 in the proof of Theorem 4.1 shows equality between an expectation and an empirical estimation of the covariance matrix.
Eq. (4) is missing the cross terms $r_{overthinking}^T h^{l}(x_i) r_{other}$ and $r_{other}^T h^{l}(x_i) r_{overthinking}$
The covariance matrix in Eq. (5) is missing the average on the LHS: $\frac{1}{N-1}(A^{l}-\overline{A}^{l}) (A^{l}-\overline{A}^{l})^T

Apart from the formal flaws, the paper only focuses on DeepSeek-R1-based reasoning traces. It is not given that the finding will also hold for other families of LRMs.

问题

I would appreciate it if the formal flaws could be addressed in a revised version.
Moreover, experiments on another family of reasoning models (e.g., Qwen 3) might be interesting.
Line 150 states that the intervention is applied across all layers. However, Lines 264-265 mention only single layers to be adjusted. Please clarify.

局限性

Limitations are provided in the paper (Section 6).

最终评判理由

After the rebuttal clarified some of my concerns regarding fundamental steps in the method, I have updated my score to a weak accept. The paper presents an effective method supported by theoretical insights and practical experiments. As noted by other reviewers, remaining weaknesses include the approach to hyperparameter selection and the question of generalizability to non-mathematical design tasks.

格式问题

N/A

作者回复

2025-07-31

We greatly appreciate you for recognizing the importance of addressing overthinking in LRMs, as well as the success of our method in mitigating it in R1-based models. Below, we address your concerns in Weaknesses (W), Questions (Q) and Flaws (F). We sincerely hope that you can find our response satisfactory.

W1&Q1-F1. Paragraph "Linear Low-Dimensional Manifold Verification" states that the steering direction is composed of an overthinking ( $r_{\text{overthinking}}$ ) and an error ( $r_{\text{other}}$ ) term by performing a PCA on joint redundant and concise reasoning traces. However, I don't see how a PCA shows orthogonality between the two by doing a joint PCA. The PCA just shows the concentration of the variance in the first $k=10$ dimensions (of the PCA), no separation between different traces.

For the orthogonality of $r_{\text{overthinking}}$ and $r_{\text{other}}$ , it is a property arising from the principles of PCA. For a centered data matrix $A^{(l)} \in \mathbb{R}^{d \times N}$ , where $d$ is the ambient dimension (i.e., the hidden size of activations) and $N$ is the number of activation vectors, the covariance matrix is $C^{(l)} = \frac{1}{N-1} (A^{(l)} - \bar{A}^{(l)}) (A^{(l)} - \bar{A}^{(l)})^\top$ . Its eigenvalue decomposition is $C^{(l)} = U^{(l)} \Lambda^{(l)} (U^{(l)})^\top$ , where $U^{(l)} = [u_1^{(l)}, \dots, u_d^{(l)}]$ is orthogonal ( $(U^{(l)})^\top U^{(l)} = I$ ), and $\Lambda^{(l)} = \text{diag}(\lambda_1^{(l)}, \dots, \lambda_d^{(l)})$ with $\lambda_1^{(l)} \geq \dots \geq \lambda_d^{(l)} \geq 0$ . The low-dimensional subspace $\mathcal{M}$ is spanned by $U_{\text{eff}}^{(l)} = [u_1^{(l)}, \dots, u_k^{(l)}]$ , the top $k$ eigenvectors, satisfying $(U_{\text{eff}}^{(l)})^\top U_{\text{eff}}^{(l)} = I_k$ . The projection matrix is $P_{\mathcal{M}} = U_{\text{eff}}^{(l)} (U_{\text{eff}}^{(l)})^\top$ . The overthinking component is $r_{\text{overthinking}} = P_{\mathcal{M}} r^{(l^\star)} = U_{\text{eff}}^{(l)} (U_{\text{eff}}^{(l)})^\top r^{(l^\star)}$ , and the error term is $r_{\text{other}} = r^{(l^\star)} - r_{\text{overthinking}} = (I - P_{\mathcal{M}}) r^{(l^\star)}$ . Their inner product is $\langle r_{\text{overthinking}}, r_{\text{other}} \rangle = (r^{(l^\star)})^\top P_{\mathcal{M}}^\top (I - P_{\mathcal{M}}) r^{(l^\star)}$ . Since $P_{\mathcal{M}}$ is symmetric ( $P_{\mathcal{M}}^\top = P_{\mathcal{M}}$ ) and idempotent ( $P_{\mathcal{M}}^2 = P_{\mathcal{M}}$ ), we have $P_{\mathcal{M}} (I - P_{\mathcal{M}}) = P_{\mathcal{M}} - P_{\mathcal{M}}^2 = 0$ . Thus, $\langle r_{\text{overthinking}}, r_{\text{other}} \rangle = 0$ , confirming their orthogonality due to PCA’s orthogonal projection.

Finally, we appreciate your valuable feedback, and we will incorporate a clearer explanation in the revised version.

W1&Q1-F2. Moreover, the comparison between the magnitudes of $r_M$ and $r\_{\text{other}}$ are confusing. Step 2 in the proof of Theorem 4.2 states that the magnitude of the error term is much larger than the $r_M$ . However, $r_M$ contains 70% of the eigenvalues, hence it has a larger norm!

We are sorry for this mistake and the resulting confusion. The direct comparison in Step 2 of the proof for Theorem 4.2, stating that $d \gg k, \|r\_{\text{other}}\|_2^2 \gg \|r_M\|_2^2$ leading to $\|r^{(l^\star)}\|_2 \approx \|r\_{\text{other}}\|_2$ , may not always hold, particularly given that the subspace associated with $r_M$ captures 70% of the eigenvalues (variance), which could result in a larger norm for $r_M$ under strong alignment with activation differences.

To be fortunate, this does not impact the core conclusion of Theorem 4.2, which establishes that the mean activation shift $\Delta \mu^{(l)}$ , induced by the intervention along the over-thinking direction $r^{(l^\star)}$ , has an unexpected norm proportional to $\alpha \|r\_{\text{other}}\|_2$ . This shift undergoes layer-wise amplification through the effects of attention mechanisms, non-linear activations, and weight matrices, leading to disruption of the model's normal abilities.

For a more rigorous description, we will revise Step 2 to remove the direct magnitude comparison and instead clarify the actual noise contribution from $r\_{\text{other}}$ and how it drives a shift component that propagates and amplifies across layers.

W1&Q1-F3. Theorems 4.1 and 4.2 are based on empirical evidence; hence, they cannot be stated as formal theorems. When being framed as a formal theorem, all assumptions have to be stated inside the theorem.

We appreciate the reviewer’s valuable feedback. However, Theorems 4.1 and 4.2 are truly formal theorems, not empirical findings. Here, empirical evidence is used solely to further validate the correctness of these analytical solutions.

For Theorem 4.1, it derives the analytical expression for the expected noise norm of the intervention component, based on algebraic properties of covariance and projection matrices. Theorem 4.2 similarly computes the mean activation shift $\Delta \mu^{(l)}$ and its layer-wise amplification in transformers, derived from mathematical properties of attention mechanisms, GeLU non-linearities, and residual connections, rather than empirical evidence.

The reviewer’s concern may stem from less precise phrasing elsewhere, such as the case in W1&Q1-F4. To avoid any misunderstanding and enhance readability, we will revise Theorems 4.1 and 4.2 to explicitly include the assumptions and premises (e.g., low-dimensional manifold, transformer architecture specifics) in their statements. Thanks for your reminder again !

W1&Q1-F4. Step 1 in the proof of Theorem 4.1 shows equality between an expectation and an empirical estimation of the covariance matrix.

We apologize for the imprecise expression. The correct statement should clarify that the empirical covariance estimate converges to the theoretical expectation as the datasets $D_{\text{redundant}}$ and $D_{\text{concise}}$ grow sufficiently large and representative, i.e., $\mathbb{E}[r^{(l^\star)} r^{(l^\star)\top}] \approx \frac{C^{(l)}}{|D_{\text{redundant}}|} + \frac{C^{(l)}}{|D_{\text{concise}}|}$ in the limit of large sample sizes. We will revise the manuscript to include a clearer and more rigorous description.

W1&Q1-F5. Eq. (4) is missing the cross terms.

As stated in our response to W1&Q1-F1, $r_{\text{overthinking}}$ and $r_{\text{other}}$ are orthogonal due to PCA's orthogonal projection, satisfying $\langle r_{\text{overthinking}}, r_{\text{other}} \rangle = r_{\text{overthinking}}^T r_{\text{other}} = 0$ , and $\langle r_{\text{other}}, r_{\text{overthinking}} \rangle = r_{\text{other}}^T r_{\text{overthinking}} = 0$ . As a result, the cross terms vanish, allowing us to eliminate them in Eq. (4).

W1&Q1-F6. The covariance matrix in Eq. (5) is missing the average on the LHS: $\frac{1}{N-1}(A-\overline{A})(A-\overline{A})^T$ .

We apologize for the missing the average on the LHS and will correct it in the revised version instantly. Thanks for your reminder again!

W2&Q2. Apart from the formal flaws, the paper only focuses on DeepSeek-R1-based reasoning traces.

Thanks for your suggestion. To address this concern, we have extended our evaluation to include three additional models: Skywork-OR1-7B, Qwen3-8B, and QwQ-32B. Here, we also compare our manifold steering against the baseline SEAL method on GSM8k and Math500. The results are as follows, where we could find that our approach consistently outperforms SEAL across all tested models, exhibiting superior performance and robust model generalizability.

Model	Methods	GSM8K Pass@1 (↑, %)	GSM8K #Tokens (↓)	MATH500 Pass@1 (↑, %)	MATH500 #Tokens (↓)
	Vanilla	93.2	2377	95.8	5033
Skywork-OR1-7B	SEAL	93.4 (+0.2)	1639 (-31%)	95.6 (-0.2)	3775 (-25%)
	Ours	93.4 (+0.2)	951 (-60%)	96.2 (+0.4)	2986 (-41%)
	Vanilla	97.7	1583	96.2	5280
Qwen3-8B	SEAL	97.9 (+0.2)	1123 (-29%)	95.8 (-0.4)	4384 (-17%)
	Ours	98.0 (+0.3)	612 (-61%)	96.4 (+0.2)	3273 (-38%)
	Vanilla	96.7	2040	95.4	4447
QwQ-32B	SEAL	96.8 (+0.1)	1714 (-16%)	94.8 (-0.6)	3602 (-19%)
	Ours	97.1 (+0.4)	1265 (-38%)	95.6 (+0.2)	2668 (-40%)

Q3. Line 150 states that the intervention is applied across all layers. However, Lines 264-265 mention only single layers to be adjusted. Please clarify.

Actually, in Lines 264-265, the metioned layer refers to the layer used to compute the steering direction, not the layer where steering is applied. During the inference stage, steering is indeed applied to all model layers, consistent with the approach described in Line 150. We will further clarify it in the revised version. Thanks for your reminder!

2025-08-01

Thank you for your elaborations. The step of projecting $r^{l*}$ to the top-k principal components was not clear before. Your explanation clarified it! I will raise my score.

2025-08-01

We once again appreciate your time reviewing our work and rebuttal, and we’re glad to see that our explanation has addressed your concerns! We will include the above explanation in the revision properly.

审稿意见

评分: 2置信度: 52025-06-18

This paper presents Manifold Steering, an Inference Time Intervention (ITI) strategy to mitigate overthinking in Large Reasoning Models (LRMs). The authors first sample "redundant" and "concise" responses to derive a steering vector that assocates with the overthinking direction. Given that, the authors perform ITI on LRMs at various intensities，and observe that the steer contains noisy components that interfere with reasoning. Through PCA on the activations, they reveal that these distruptive components reside in a subspace spanned by residuals. By removing the redudance via a simple linear transformation, Manifold Steering(MS) achieves effective and robust overthinking mitigation than simple ITI.

The authors conduct comprehensive experiments to demonstrate MS:

(1) consistently reduce generation tokens while maintaining reasoning performance, and

(2) demonstrates strong cross-domain generalization.

优缺点分析

Strengths:

The discovery of the "low-dimensional manifold" within a steer vector is interesting. The interpretability tool introduced to identify such a subspace is both intutive and easy to grasp. It resonates with me, as residual components that preserve core language abilities can indeed hinder precise reasoning. Overall, I find these operations and insights valuable.
The experiments are fairly comprehensive and effectively address concerns about the robustness of the ITI method in cross-domain transferability. The experimental results also confirm the method’s effectiveness in token reduction.
This paper is mostly well-written and easy to follow.

Weakness

The font size used in the figures is too small to read
Decomposing activations or parameters and performing steering using selected low-dimensional components has been extensively studied in prior works, for example [1, 2]. However, the discussion of these related studies is insufficient.
See Questions.

[1] Spectral Filters, Dark Signals, and Attention Sinks

[2] Effectively Steer LLM To Follow Preference via Building Confident Directions

问题

Q1: The Implementation Details section (Lines 264–265) specifies the layers used to implement Manifold Steering. However, it lacks an explanation for why these particular layers were chosen or a discussion on the sensitivity of layer selection. Could you provide some insight or illustration to justify your choice?

Q2: You manually select the values of $\alpha$ for different models based on test performance and generation length. This is a form of data leakage and gives your method an unfair advantage over the baselines. I remain unconvinced by the experimental fairness unless a method for selecting $\alpha$ that is independent of test accuracy is proposed.

Q3: Could the underlying mechanism by which your method reduces overthinking be the simple suppression of transitional tokens such as "wait" or "alternatively"? Figure 4 shows that as α decreases, the token "wait" appears more frequently. [1] noted that such tokens tend to carry high entropy. They may share significant features on a low-dimensional manifold—which may coincide with the one your method identifies. Or, could you report the performance of a model that simply removes "wait" and "alternatively" from the vocabulary? Would it be comparable to your method?

Update: Today I came across a new paper [2], which explicitly removes "wait" from the vocabulary, leading to shorter generations and improved performance. I’m not suggesting that your work should be compared with this paper, especially since your submission predates it. Rather, I mention it to support my hypothesis raised in Q3.

[1] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

[2] Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency

局限性

yes

最终评判理由

After rebuttal, I continue to have concerns about this paper. The selection of hyperparameters for the paper’s core method is based on a preprint (titled SEAL) that has not undergone peer review. Specifically, the authors determined the optimal α by evaluating performance on Math-500, yet they include Math-500 results in their main experiments as evidence of the method’s effectiveness—this is test-data leakage.

Additionally, since the paper focuses on mathematical tasks, hyperparameter tuning on Math-500 (an in-domain dataset) benefits other math tasks.

I raised these issues, but the authors defended their hyperparameter selection by citing SEAL, which failed to address my concerns, because I find SEAL's hyper-parameter selection is also problematic. As a result, I kept my rating to reject this paper.

I found that while other reviewers have consistently gave high ratings to this paper, none have raised the issue of test data leakage. I kindly request that ACs examine this concern and take it into account as part of a comprehensive evaluation. Thank you.

格式问题

The submission violates Section 3 of the Formatting Instructions: all headings should be lower case (except for the first word and proper nouns). However, in this paper, all words in the headings are capitalized.

Also, according to Section 4.4 in Formatting Instructions: "publication-quality tables do not contain vertical rules." The tables in the submission include vertical rules.

作者回复

2025-07-31

We greatly appreciate your recognition of our innovative discovery of the "low-dimensional manifold" within a steer vector, fairly comprehensive experiments, as well as the paper’s clarity. Below, we address your concerns in Weaknesses (W) and Questions (Q). We sincerely hope that you can find our response satisfactory.

W1. The font size used in the figures is too small to read

Thanks for your suggestion. We will increase the font size of figures in the revised version to ensure clarity and better readability.

W2. Decomposing activations or parameters and performing steering using selected low-dimensional components has been extensively studied in prior works, for example [1, 2]. However, the discussion of these related studies is insufficient.

Prior work, such as [1], uses spectral filters based on singular value decomposition (SVD) of embedding and unembedding matrices to partition the residual stream spectrum in transformer-based LLMs. While their method suppresses noise, they mainly focus on explaining attention sinks through "dark signals" in tail-end subspaces. In contrast, we specifically address intrinsic mitigation of overthinking in reasoning models. In addition, we also provide a theoretical analysis on the expected noise norm in high-dimensional spaces (Theorem 4.1).

For [2], it innovatively utilizes confident directions derived from user history activations and filtered through probabilistic classification with logistic regression for noise reduction, emphasizing robustness in multi-preference alignment scenarios. Different from it, our approach mainly relies on the low-dimensional manifold projection to eliminate interference noise, targeting intrinsic activation patterns for reasoning efficiency rather than behavioral preferences, thereby enabling precise interventions across diverse LRMs.

Lastly, thanks for your suggestion and we will add detailed discussion with these related studies to the revised version.

Q1. The Implementation Details section (Lines 264–265) specifies the layers used to implement Manifold Steering. However, it lacks an explanation for why these particular layers were chosen or a discussion on the sensitivity of layer selection. Could you provide some insight or illustration to justify your choice?

We are sorry for the lack of implementation details for layer selection. In this work, we follow prior work SEAL and conduct a hyper-parameter search on the validation set rather than the test datasets to identify the optimal layer. We apologize for not including this detail in the manuscript. (we will add the whole search results in the revised version) The results show that the second-to-last layer of the model typically yields the best steering direction. This is reasonable, as representations in later layers tend to become more linear due to the cumulative effects of residual connections and low-norm block contributions (also corresponds to [1,2]), which facilitate stable and interpretable steering while preserving high-level semantic information.

[1] Razzhigaev et al. Your Transformer is Secretly Linear

[2] Hernandez et al. Linearity of Relation Decoding in Transformer Language Models

Q2. You manually select the values of $\alpha$ for different models based on test performance and generation length. This is a form of data leakage and gives your method an unfair advantage over the baselines. I remain unconvinced by the experimental fairness unless a method for selecting that is independent of test accuracy is proposed.

Actually, the way we use for selecting $\alpha$ is following the baseline method SEAL, which selects $\alpha$ on MATH500 and the $\alpha$ is not tuned specifically for other test datasets. As shown in Tab. 1 and Fig. 3, our method demonstrates strong generalization across different datasets with the same $\alpha$ value.

To further improve clarity, we will include a specific note in the revised version to highlight that $\alpha$ is selected solely based on MATH500, reinforcing that our approach is independent of test accuracy on other datasets.

Q3 & Update. Could the underlying mechanism by which your method reduces overthinking be the simple suppression of transitional tokens such as "wait" or "alternatively"? Figure 4 shows that as α decreases, the token "wait" appears more frequently. [1] noted that such tokens tend to carry high entropy. They may share significant features on a low-dimensional manifold—which may coincide with the one your method identifies. Or, could you report the performance of a model that simply removes "wait" and "alternatively" from the vocabulary? Would it be comparable to your method?

Thanks for your suggestion. First, we want to clarify that Figure 4 is used to illustrate the bidirectional effect of the $\alpha$ parameter, i.e., positive $\alpha$ values drive our method to mitigate overthinking, producing concise outputs, while negative $\alpha$ values enhance overthinking, leading to verbose generations. Your observation that "as α decreases, the token 'wait' appears more frequently" aligns with our demonstration of amplified overthinking in the negative $\alpha$ regime. Howover, when the $\alpha$ changes to be positive, in the mitigation direction, we will instantly observe a significant reduction in high-entropy transitional tokens like "wait" and "alternatively," consistent with [1]'s findings, effectively reducing unnecessary deliberation.

As you suggest, We also test the performance of removing "wait" and "alternatively" from the vocabulary on R1-7B with GSM8k. This approach reduce the tokens by 31%, which is less effective than ours, 62%. This may be because suppressing specific tokens shifts probability to other redundant words, failing to address all overthinking patterns. In contrast, our method more fundamentally captures the essential overthinking modes.

2025-08-04

Thank you for your response. However, I have the following clarifications to raise:

Q1. Sorry, I remain unconvinced by experimental results that are neither included in the PDF manuscript nor accompanied by specific numerical values in the rebuttal. Concrete and documented data is necessary to support the claims.

Q2: Relying on SEAL (a method that has not yet passed peer review) for selecting α does not validate the reasonableness or fairness of your experiments. While tuning hyperparameters on Math500 is acceptable, using Math500 itself as a benchmark to demonstrate your method’s effectiveness is problematic. SEAL also highlights improvements on Math500, which is similarly problematic. Additionally, since your other test datasets are math-related, in-domain hyperparameter tuning on Math500 may benefit these tasks. If α was indeed tuned on Math500, it is critical to provide out-of-domain results on non-mathematical tasks to verify generalizability—and Math500 should not be treated as evidence of effectiveness.

Q3: Your explanation makes sense.

Overall, I maintain my scores due to problematic hyperparameter searching and results presentation.

2025-08-05

We sincerely appreciate your feedback and suggestions. To further address your concerns, we provide the following clarifications and hope you find our response satisfactory

We are sorry for not including the specific results in the original manuscript. Due to space constraints during the rebuttal stage, we were unable to include all such results in our response and committed to including them in the revision.

However, we appreciate such an extra chance and space, and we could present the specific results below. As shown in the results, to ensure accuracy is maintained, our currently selected layers: R1-1.5B (layer 27), R1-7B (layer 27), R1-8B (layer 31), and R1-14B (layer 47), are indeed the optimal choices.

For R1-1.5B,

	Vanilla	Layer 1	Layer 5	Layer 10	Layer 15	Layer 20	Layer 25	Layer 27
Accuracy (%)	76.4	76.6	77.0	76.4	57.6	74.8	67.4	78.6
# Tokens	4762	4472	4434	4223	1469	3930	1179	3458

For R1-7B,

	Vanilla	Layer 1	Layer 5	Layer 10	Layer 15	Layer 20	Layer 25	Layer 27
Accuracy (%)	88.2	88.4	88.0	88.2	84.4	80.6	72.2	88.4
# Tokens	3824	3685	3665	3701	2713	1906	1070	2239

For R1-8B,

	Vanilla	Layer 1	Layer 5	Layer 10	Layer 15	Layer 20	Layer 25	Layer 30	Layer31
Accuracy (%)	87.8	87.2	87.8	88.2	75.6	86.4	71.8	87.6	88.0
# Tokens	4009	3896	3820	3654	2950	3280	1856	2975	2873

For R1-14B,

	Vanilla	Layer 1	Layer 5	Layer 10	Layer 15	Layer 20	Layer 25	Layer 30	Layer 35	Layer 40	Layer 45	Layer 47
Accuracy (%)	92.8	92.4	92.4	92.0	92.2	92.6	89.8	80.4	87.4	84.6	82.4	92.8
# Tokens	3496	3384	3420	3095	2958	2857	2398	1814	2207	1836	1625	2074

Q2. Relying on SEAL (a method that has not yet passed peer review) for selecting α does not validate the reasonableness or fairness of your experiments. While tuning hyperparameters on Math500 is acceptable, using Math500 itself as a benchmark to demonstrate your method’s effectiveness is problematic. SEAL also highlights improvements on Math500, which is similarly problematic. Additionally, since your other test datasets are math-related, in-domain hyperparameter tuning on Math500 may benefit these tasks. If α was indeed tuned on Math500, it is critical to provide out-of-domain results on non-mathematical tasks to verify generalizability—and Math500 should not be treated as evidence of effectiveness.

Thanks for your suggestion! Actually, as shown in Figure 3 of the manuscript, we have tested on out-of-domain datasets, including LiveCodeBench and GPQA, demonstrating consistently great performance. To further verify our method's superiority, we also provide the comparative results of R1-7B with the baseline SEAL as follows:

Models	Methods	LivecodeBench Accuracy(↑, %)	LivecodeBench #Tokens (↓)	GPQA-Diamond Accuracy(↑, %)	GPQA-Diamond MATH500 #Tokens (↓)
	Vanilla	29.8	7931	30.5	10340
R1-1.5B	SEAL	29.8 (+0.0)	7653 (-4%)	31.3 (+0.8)	9845 (-5%)
	Ours	30.3 (+0.5)	6983 (-12%)	31.8 (+1.3)	9003 (-13%)
	Vanilla	46.5	6717	61.5	8323
R1-7B	SEAL	47.3 (0.8)	6248 (-7%)	60.6 (-0.9)	7758 (-7%)
	Ours	47.5 (+1.0)	4881 (-27%)	65.0 (+3.5)	7067 (-15%)
	Vanilla	47.0	7963	60.0	8480
R1-8B	SEAL	46.0 (-1.0)	7325 (-8%)	59.6 (-0.4)	7876 (-7%)
	Ours	49.0 (+2.0)	6948 (-13%)	60.0 (+0.0)	7419 (-13%)
	Vanilla	55.6	5961	76.8	7262
R1-14B	SEAL	56.8 (+0.2)	5564 (-7%)	77.3 (+0.5)	6815 (-6%)
	Ours	57.6 (+2.0)	4814 (-19%)	77.3 (+0.5)	6019 (-17%)

审稿意见

评分: 5置信度: 32025-06-24

This paper introduces Manifold Steering, a novel method to mitigate "overthinking" in Large Reasoning Models (LRMs). The authors begin by identifying a single "overthinking direction" in the model's activation space but find that steering along it introduces interference noise that limits its effectiveness. They further discover that the overthinking-related activations reside on a low-dimensional manifold. Manifold Steering leverages this by projecting the initial steering vector onto this manifold to remove the noise. This technique leads to token reductions of up to 71% on math benchmarks, maintains or improves accuracy, and demonstrates strong transferability to other domains.

优缺点分析

Strengths

The paper's proposed approach efficiently addresses LRM overthinking from the perspective of mechanistic interpretability, offering a new perspective on the "overthinking" problem.
The proposed method achieves impressive experiment results, consistently reducing token counts while maintaining or even improving accuracy across evaluated math benchmarks.
The paper is well-written and logically organized, presenting a sound derivation and theoretical justification for the proposed method.

Weaknesses

The evaluation is limited to the DeepSeek-R1-Distilled model family, which raises questions regarding the generalizability of the proposed method.
The method intervention strength is sensitive to real-time task complexity, which might limit the practicality of the method.

问题

Do you plan to test this method on non-DeepSeek-distilled models?
In section 3.2, the steering is applied to all model layers. However, in section 5.1, model steering is only applied to a single specific layer, why? How do you choose this layer, does it involve expensive hyper-parameter search?
Table 1 shows that Manifold Steering can sometimes improve the accuracy on challenging math benchmarks like AIME24, could you share some insights behind this phenomenon?

局限性

yes

最终评判理由

The author provide extra strong results of their proposed methods on some latest open-source models, and also provide clear clarifications on their methods. There're no remaining issues, and I keep my score at 5.

格式问题

作者回复

2025-07-31

We greatly appreciate your recognition of the importance of mitigating overthinking, as well as your positive feedback on our mainfold steering's simplicity, effectiveness, and broad applicability. Below, we address your concerns in Questions (Q). We sincerely hope that you can find our response satisfactory.

Q1. In (2), does $x$ refer to "prompt + response" or only "prompt"? Also, is (3) applied to each decoding step or only at the first decoding step?

In Eq. (2), $x$ refer to "prompt + response" and Eq. (3) applied to each decoding step.

Q2. In (5), why $C = \frac{1}{N - 1}A(A - \bar{A})^T$ instead of $C = \frac{1}{N - 1}(A - \bar{A})(A - \bar{A})^T$ as the latter one is the definition of the covariance matrix?

Thank you for pointing this out. We are sorry for this mistake and will revise it instantly. The expression in Equation (5) should indeed be $C = \frac{1}{N-1}(A - \bar{A})(A - \bar{A})^T$ .

Q3. In Theorem 4.1, it states $C^{(\ell)}$ is for the redundant dataset, but in line (195) it says $A^{(\ell)}$ is calculated using all reasoning data. Which one is correct?

We sincerely thank the reviewer for pointing out this mistake. The correct statement, as described in line 195, is that $A^{(\ell)}$ is calculated using all reasoning data. We will revise Theorem 4.1 in the revised version. Thanks for your reminder again !

Q4. In Line 240, $P = UU^T$ where $U \in \mathbb{R}^{d \times k}$ and $k << d$ , then how could $P = I$ (it cannot be full-rank)?

We sincerely appreciate the reviewer for identifying the issue with the statement $P_M = I$ in Line 240. We must acknowledge that stating $P = I$ is a mistake, as $P = UU^T$ with $U \in \mathbb{R}^{d \times k}$ and $k \ll d$ cannot be full-rank.

Our intended explanation is that when $r^{(l^\star)}\_{\text{new}}$ is transformed by the projection matrix to become $r^{(l^\star)}\_{\text{new}}=P_M r^{(l^\star)} = r^{(l^\star)}\_{\text{overthinking}}$ , the new $r^{(l^\star)}\_{\text{new}}$ satisfies $r^{(l^\star)}\_{\text{new}} = P^{\text{new}}_M r^{(l^\star)}\_{\text{new}}$ , i.e., $(I-P^{\text{new}}_M)r^{(l^\star)}\_{\text{new}} = 0$ , since $r^{(l^\star)}\_{\text{new}}$ lies within the subspace $M$ . However, we erroneously imply that $I - P^{\text{new}}_M = 0$ . In fact, the reason $E[|r\_{\text{other}}||_2^2] = \text{tr}((I - P^{\text{new}}_M) \Sigma^{(l)}{\text{noise}}) = 0$ holds is because $\Sigma^{(l)}\_{\text{noise}}$ is primarily supported in $M$ , resulting in zero components under $M^{\perp}$ . We will revise the statement in the revised version. Thanks for your reminder again !

评论- An apology for the order error

2025-08-01

Dear Reviewer 9Mtj,

We sincerely apologize for an ordering error in our rebuttal. While carefully proofreading each response to ensure accuracy, we inadvertently arranged our responses in character order rather than the original review sequence. This means the responses to you and Reviewer AXAi have been reversed in order.

We deeply regret this mistake and are truly sorry for any trouble this may cause. If you could forgive us for this oversight, we would be immensely grateful.

With our sincere apologies,

Authors

2025-08-02

Thank you for the clarifications on the method, which has cleared my concerns. I will keep my score as it's already very high.

2025-08-02

We sincerely appreciate your continued positive recognition of our work! We are glad that our rebuttal has cleared your concerns, and will include the above clarifications in the revision properly.

审稿意见

评分: 5置信度: 42025-06-29

The paper proposed a novel method to mitigate overthinking in large reasoning models (LRMs) while maintaining the performance. First, by using the OpenMathInstruct-2 training dataset, the authors obtain a difference vector that represents the direction of the difference between concise and redundant responses. While directly steering the model along this direction too much might even increase the length of the response, the authors utilize a low-dimensional manifold obtained by principal component analysis (PCA) to effectively mitigate overthinking. They conducted extensive experiments across different models and different datasets, corroborating the effectiveness of the manifold steering method. They also showed that the manifold steering method can be effectively transferred to other tasks such as jailbreaking.

优缺点分析

Strengths:

The problem of Mitigating Overthinking is important.
The authors propose a simple yet effective method to solve the problem.
The proposed method is applicable to various tasks.

Weaknesses:

There is no obvious weakness of this paper. There might be some minor issues in the theoretical derivation (see the question section).

问题

In (2), does $x$ refer to “prompt + response” or only “prompt”? Also, is (3) applied to each decoding step or only at the first decoding step?
In (5), why $C = 1/(N-1) A(A-\bar A)\^T$ instead of $C = 1/(N-1) (A-\bar A)(A-\bar A)\^T$ , as the latter one is the definition of the covariance matrix.
In Theorem 4.1, it states $C\^{(\ell)}$ is for the redundant dataset, but in line (195) it says $A\^{(\ell)}$ is calculated using all reasoning data. Which one is correct?
In Line 240, $P = UU\^T$ where $U \in \mathbb{R}\^{d\times k}$ and $k << d$ , then how could $P = I$ (it cannot be full-rank)?

局限性

N/A

最终评判理由

The authors addressed my questions in the response. I think this paper has a good contribution and decided to keep my score.

格式问题

no major formatting issues

作者回复

2025-07-31

We greatly appreciate your recognition of our approach’s innovative perspective on LRM overthinking, its impressive experimental results, and the paper’s clear, logical presentation. Below, we address your concerns in Weaknesses (W) and Questions (Q). We sincerely hope that you can find our response satisfactory.

W1&Q1. The evaluation is limited to the DeepSeek-R1-Distilled model family, which raises questions regarding the generalizability of the proposed method.

Thanks for your suggestion. To address this concern, we have extended our evaluation to include three additional models: Skywork-OR1-7B, Qwen3-8B, and QwQ-32B. Here, we also compare our manifold steering against the strongest baseline SEAL on GSM8k and Math500. The results are as follows, where we could find that our approach consistently outperforms SEAL across all tested models, exhibiting superior performance and robust model generalizability.

Model	Methods	GSM8K Pass@1 (↑, %)	GSM8K #Tokens (↓)	MATH500 Pass@1 (↑, %)	MATH500 #Tokens (↓)
	Vanilla	93.2	2377	95.8	5033
Skywork-OR1-7B	SEAL	93.4 (+0.2)	1639 (-31%)	95.6 (-0.2)	3775 (-25%)
	Ours	93.4 (+0.2)	951 (-60%)	96.2 (+0.4)	2986 (-41%)
	Vanilla	97.7	1583	96.2	5280
Qwen3-8B	SEAL	97.9 (+0.2)	1123 (-29%)	95.8 (-0.4)	4384 (-17%)
	Ours	98.0 (+0.3)	612 (-61%)	96.4 (+0.2)	3273 (-38%)
	Vanilla	96.7	2040	95.4	4447
QwQ-32B	SEAL	96.8 (+0.1)	1714 (-16%)	94.8 (-0.6)	3602 (-19%)
	Ours	97.1 (+0.4)	1265 (-38%)	95.6 (+0.2)	2668 (-40%)

W2. The method intervention strength is sensitive to real-time task complexity, which might limit the practicality of the method.

Thanks for your comment. We fully understand your concern about its potential impact on practicality. To address your concern, we want to clarify that the intervention strength, determined through hyperparameter tuning, have exhibited impressive task generalization. As shown in Table 1 and Figure 3, the intervention strength optimized on Math500 achieves robust performance across diverse mathematical datasets of varying difficulty, including GSM8k, AMC2023, and AIME2024, as well as in code-related and subject-knowledge question-answering tasks, which verify our method’s adaptability and practical utility in real-world scenarios.

Q2. In section 3.2, the steering is applied to all model layers. However, in section 5.1, model steering is only applied to a single specific layer, why? How do you choose this layer, does it involve expensive hyper-parameter search?

We are sorry for the confusion caused by our explanation in the paper. To clarify, in Sec. 5.1, the "specific layer" refers to the layer used to compute the steering direction, not the layer where steering is applied. During the inference stage, steering is indeed applied to all model layers (following [1]), consistent with the approach described in Sec. 3.2.

Regarding the choice of the specific layer for computing the steering direction, we follow the baseline method SEAL and conduct a hyper-parameter search on the validation set rather than the test set to identify the optimal layer, which means that we do not need very expensive search cost. We apologize for not including this detail in the manuscript. We will add it in the revised version. The results show that the the second-to-last layer of the model typically yields the best steering direction.

[1] Arditi et al. Refusal in Language Models Is Mediated by a Single Direction

Q3. Table 1 shows that Manifold Steering can sometimes improve the accuracy on challenging math benchmarks like AIME24, could you share some insights behind this phenomenon?

Thanks for your suggestion. We believe this phenomenon can be attributed to two key features of Manifold Steering. First, by mitigating overthinking, our approach reduces redundant reasoning steps, thereby decreasing the number of tokens required. This allows the model to find correct answers within given token limits. Second, Manifold Steering enhances the clarity of the model’s reasoning process, as shown in Fig. 4. For complex problems, this clearer logical structure enables the model to identify correct solution paths more effectively.

2025-08-05

We sincerely appreciate your recognition of our paper as good work! We are glad that our rebuttal has addressed your questions and will include the above clarifications in the revision properly.

评论- An apology for the order error

2025-08-01

Dear Reviewer AXAi,

We deeply regret this mistake and are truly sorry for any trouble this may cause. If you could forgive us for this oversight, we would be immensely grateful.

With our sincere apologies,

Authors

2025-08-05

Thank the authors for addressing my questions. I think this is a good paper and will keep my score.

最终决定Accept (poster)

2025-09-17

This paper describes an approach to mitigating the problem of LLMs overthinking in the course solving reasoning problems. The paper presents an interesting and novel insight that the phenomenon of overthinking can be related to activation along specific directions in the models activation space. The paper proposes an approach that involves encouraging the model to steer its activations to a desirable subspace identified during training.

Strengths:

The reviewers were generally accepting of the analysis presented in the paper and appreciated the novel approach to mitigating the overthinking problem

The results in the manuscript show meaningful improvements on four mathematical reasoning benchmarks. During the rebuttal process the authors presented results on additional non mathematical datasets that strengthened the claims.

Weaknesses:

Reviewer Z6Zt raised issues regarding the selection of an important hyper parameter which governed the extent of the intervention applied in the manifold steering process. There was a specific concern about this parameter being tuned on one of the mathematical benchmark datasets that was also used for evaluation. The authors attempted to address this concern by testing the system on non-mathematical reasoning benchmarks where they also demonstrate an improvement.

Three of the four reviewers were favorably disposed to this paper and felt that it represented a promising novel approach to addressing the problem of overthinking in reasoning systems.