7.1

/10

Spotlight5 位审稿人

最低4最高5标准差0.5

3.2

置信度

创新性2.8

质量3.0

清晰度3.0

重要性2.6

NeurIPS 2025

Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

Zekai Zhao,Qi Liu,Kun Zhou,Zihan Liu,Yifei Shao,Zhiting Hu,Biwei Huang

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

关键词

Large Language ModelsLong Chain of Thoughts

评审与讨论

审稿意见

评分: 5置信度: 32025-06-30

This paper studies the mechanism behind self-reflection and long chain-of-thought (CoT) via a mechanistic interpretability perspective, by investigating the activation (representation) space. The authors find a set of important and interpretable activations in the last few layers that are attributed to output length and reflection behavior, and it can induce long CoT behavior together with some reflection token (e.g., “wait”). The paper then introduces two efficient methods to help models have a better reasoning capability (1) a training-free control technique (2) a parameter-efficient fine-tuning method.

优缺点分析

Strength:

From an interpretability perspective, this paper finds the important “reasoning” activations that are consistently existing in the long-CoT models. The authors locate the essential activations (in the last few layers) and find that they indeed have some causal effect on the reasoning capability. This is an interesting and novel finding, providing insights for the localization of the long-chain reasoning capabilities.
Based on the previous understanding, the paper also proposes a training-free method and a PEFT method to elicit this long-chain reasoning capability. Those further utilize the previous finding in the activation space, and help the language to have a better reasoning performance.
The paper is generally well-structured and clear.

Weakness:

A minor limitation is that there are many PeFT methods in the activation space, like RepE, RED, and localized fine-tuning (LoFIT), etc. The paper doesn’t compare their method with those baselines that are related.
The effect of the training-free method only works when the model has both the “wait” token inserted and the decaying activation added. Merely amplifying the activations/adding wait tokens doesn’t seem to improve the accuracy (they may even decrease) according to Table 1. It seems that the decaying factor of the activation is very important. Is it possible to disentangle those effects or provide further analysis on why this happens?
It is hard to distinguish between your tuning-free method and your PEFT method from the table at first glance (they are both called EELo-CoT). Is it possible to stress the difference?

I recommend acceptance despite the minor weaknesses I listed.

问题

See Weakness.

局限性

See Weakness.

最终评判理由

Thanks for the authors' response. I have no further questions, and I wish the authors the best of luck.

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer, We sincerely appreciate your comments and feedback which helps to improve our paper. We reply to all the concerns raised in the weaknesses and questions part.

Weakness-1: A minor limitation is that there are many PeFT methods in the activation space, like RepE, RED, and localized fine-tuning (LoFIT), etc. The paper doesn’t compare their method with those baselines that are related.

Reply to Weakness-1: Thank you for highlighting this important point. We agree that comparing with activation-based parameter-efficient fine-tuning (PeFT) methods such as RepE, RED, and LoFiT would further contextualize the contribution of our method.

As a first step toward this direction, we implemented an activation-based baseline inspired by causal tracing-style full activation vector transfer. Specifically:

1.We extracted MLP activation vectors from several transformer layers of the DeepSeek-R1-Distill-Qwen-7B model for both successful and unsuccessful CoT generations on datasets like AMC23 and GPQA.

2.We computed layer-wise activation deltas (successful − unsuccessful), and patched them into the Qwen2.5-7B base model using forward hooks during inference—similar to how RepE controls latent representations.

3.This method does not require any fine-tuning and simulates a direct activation-space manipulation baseline.

Model Name	Accuracy	Length (words)	Self-reflection
Qwen 2.5 7B baseline	30.30%	457.34	4.04%
Qwen 2.5 7B Full Activation Tracing	33.33%	592.08	73.74%
EELo-CoT (Ours)	35.86%	585.78	68.18%

Weakness-2: The effect of the training-free method only works when the model has both the “wait” token inserted and the decaying activation added. Merely amplifying the activations/adding wait tokens doesn’t seem to improve the accuracy (they may even decrease) according to Table 1. It seems that the decaying factor of the activation is very important. Is it possible to disentangle those effects or provide further analysis on why this happens?

Reply to Weakness-2: Indeed, adding the “wait” token influences the generation behavior by encouraging the model to enter a self-reflective mode. However, our findings suggest that while the presence of the token can trigger self-reflection behavior, it does not ensure the quality or correctness of that reflection. The precision and effectiveness of self-reflection are largely governed by the activation values of specific neurons we identify and modulate. This distinction is supported by our ablation results: inserting the “wait” token alone increases the self-reflection rate but has limited effect on accuracy. Conversely, when combined with activation amplification, we observe consistent improvements in both reflection and correctness. This indicates that the interaction between token-level cues and neuron-level control is synergistic.

Model Name	Add Wait Token Only	Add token and Activation Values Applied
Qwen 2.5 7B	33.33% Acc on GPQA	35.86% Acc on GPQA
Qwen 2.5 Math 7B	32.83% Acc on GPQA	37.88% Acc on GPQA

Model Used	Wait token only	Wait token + Activation Values
Qwen 2.5 7B base	Wait, we're not given the products 1 and 2. Let's go through the steps again... how many carbon atoms are there in product 3? (A) 10 (B) 12 (C) 14 (D) 11. We're still not given the products. Let's go through the steps again... how many carbon atoms are there in product 3? (A) 10 (B) 12 (C) 14 (D) 11. ... Repeating infinitively	... So, the answer is (A) 10. But why did I get this wrong? Because I made a mistake when I was reading the question! I thought I was supposed to count the total number of carbon atoms in all three products, but I only needed to count the number of carbon atoms in product 3. Let me try again... After re-reading the question, I realize that I should only be counting the number of carbon atoms in product 3. And since the reactions didn't change the number of carbon atoms, the answer is indeed (A) 10. Sorry about that, and I hope this explanation helps!

Weakness-3: It is hard to distinguish between your tuning-free method and your PEFT method from the table at first glance (they are both called EELo-CoT). Is it possible to stress the difference?

Reply to Weakness3: We need to make the following clarifications:

The motivation for us to develop the intervention method is to have an easy to implement method with low computational cost where the fixed analytic function only needs to be fitted once with zero training cost.
In order to carry the observations we had from the intervention method and have a trainable method to achieve higher performance, we developed a learnable amplification module, as detailed in section4, which requires higher computational cost but with better performance. This design provides a trade-off between efficiency and performance, and users may choose the appropriate version depending on their deployment needs. We will make this distinction clearer in the revised version with a better renaming for the trainable method.

评论- Thank you for Your Positive Feedback

2025-08-05

Dear Reviewer,

Thank you for taking the time to review our submission. I truly appreciate your thoughtful feedback and the effort you put into evaluating our work.

I noticed that the overall score appears to have been changed, and I would be grateful to better understand the reasoning behind this decision. I am committed to improving this work and would deeply value any additional clarification you can provide regarding the key areas where the paper could be strengthened—whether in methodology, evaluation, writing, or positioning within the literature.

I recognize the importance of constructive feedback in the peer-review process, and I’m eager to learn from it so I can address the concerns thoroughly, either in a revised version of this paper or in future research directions.

Thank you again for your time.

2025-08-05

Thanks for the authors' detailed response. I appreciate that the authors timely responded to my questions and conducted the additional experiments. They addressed most of my concerns. I have no further questions.

And I just realized the authors cannot see my response/final justification. I didn't change the score, and I keep my recommendation.

评论- Thank you

2025-08-06

Thank you for your thoughtful feedback and for acknowledging our additional experiments. We appreciate your time and engagement.

评论- Thank you

2025-08-09

Dear Reviewer

As the rebuttal period is coming to a close, we would like to sincerely thank you for your valuable feedback and thoughtful suggestions. We truly appreciate the time and effort you’ve taken to engage with our work—your input has been incredibly helpful in guiding us to improve the paper.

Thank you again for your support!

Sincerely, The Authors

审稿意见

评分: 4置信度: 32025-07-02

Based on observations of activation patterns during reasoning, this paper proposes two simple methods to improve chain-of-thought (CoT) reasoning in language models: one that is training-free, relying on inference-time intervention, and another that is parameter-efficient, requiring only minimal fine-tuning.

优缺点分析

Strengths

The paper presents interesting empirical observations about activation patterns in language models that are associated with longer chain-of-thought (CoT) reasoning and self-reflective behavior.
The proposed intervention methods are both simple and practical—particularly the training-free approach—which enables improvements in reasoning without requiring model fine-tuning or architectural changes.

Weaknesses

Some parts of the paper lack clarity and precision in exposition. Specific concerns are raised in the questions section below.

问题

Why use the number of digits in the last sentence to determine whether to insert the "wait" token?
Since activations are vectors, how exactly are scalar “activation values” defined in this work? Are they projections onto a direction, an average over selected neurons, or something else?
Is the “amplification factor” in Section 2 equivalent to the constant intervention method used in Table 1? If so, why do the two yield such different results (positive trends in Figures 2 & 3, mostly negative results in Table 1)? If not, how do they differ operationally, and why is it not reported in Section 3 alongside other decoding strategies??
Table 3 is not referenced in the main text. Also, it appears that the average reasoning length does not significantly increase, which seems at odds with the intervention’s stated goal.
Since activations are determined by inputs, have the authors examined whether specific textual patterns naturally induce high reflection activations? This might suggest alternative, more interpretable control signals.

局限性

yes

最终评判理由

Since my main concerns were about writing and presentation, and the authors have shown a strong commitment to improving these aspects, I am willing to raise my evaluation. However, the improvement hinges on promised revisions and additional experiments, so the ultimate impact will depend on execution, and I can see how others might disagree.

格式问题

N.A.

作者回复

2025-07-31

Dear Reviewer, We sincerely appreciate your comments and feedback which helps to improve our paper. We reply to all the concerns raised in the weaknesses and questions part.

Question-1: Why use the number of digits in the last sentence to determine whether to insert the "wait" token?

Reply to Question-1: We use the number of digits in the last generated sentence as a heuristic signal to determine whether to insert the “wait” token because digit-heavy outputs often correspond to arithmetic or numerical reasoning steps, where models are known to be particularly error-prone[Lewkowycz et al., 2022], [Wang et al., 2022], [Rae et al., 2021].

In our observations, errors frequently arise when the model performs arithmetic near the end of reasoning chains—precisely where a self-reflection trigger like “wait” can help.

Thus, we adopt a simple trigger strategy: if the last sentence contains more than a threshold number of digits (e.g., ≥5), we inject a “wait” token before the next sentence, prompting the model to reconsider its previous steps.

We strongly agree this is just one instantiation of a broader reflection-triggering strategy, to simply verify our motivation of activation controlling in eliciting long-cot ability. The mechanism is modular and can be easily adapted to other signals, such as: Frequent use of uncertain terms (e.g., “maybe”, “I think”), High-entropy or low-confidence predictions (e.g., from logits), Specific vocabulary patterns or syntactic cues. We are proposing a new rule and are refining it.

Question-2: Since activations are vectors, how exactly are scalar “activation values” defined in this work? Are they projections onto a direction, an average over selected neurons, or something else?

Reply to Question-2: In the first experiment, we verified that only a small amount of activation values is useful. Therefore, we extracted a small amount of activation positions from the activation vectors. Then, in the intervention procedure, we only focused on the extracted activation positions. At the same time, we approximate the activation patterns by using a function. Then, during inference, we control the activation positions by using the found function.

Question-3: Is the “amplification factor” in Section 2 equivalent to the constant intervention method used in Table 1? If so, why do the two yield such different results (positive trends in Figures 2 & 3, mostly negative results in Table 1)? If not, how do they differ operationally, and why is it not reported in Section 3 alongside other decoding strategies?

Reply to Question-3: Section 2 (Figures 2 & 3): The amplification factor study is conducted as a controlled experiment, where we isolate the effect of activation scaling by keeping all other variables fixed (e.g., no wait token insertion, no adaptive scheduling). The goal is to analyze how increasing the amplification factor affects accuracy and self-reflection in isolation. These plots show trends, not benchmark-level performance.

Table 1 (Constant Intervention row): this row evaluates constant amplification on full benchmarks, where the model must solve a wide variety of problems. Unlike the controlled setting in Section 2, Table 1 includes additional factors such as: A fixed set of 150 activation positions, selected via contrastive analysis. No use of analytic scheduling (unlike in Section 3's EELo-CoT method). As a result, the constant amplification may be misaligned with the reasoning context in some cases—leading to unstable or degraded performance on certain examples.

Section 3 introduces the full EELo-CoT framework, which improves over constant intervention by: Dynamically modulating activation values based on token position (via f(t)). Triggering amplification only after wait tokens. Avoiding excessive or context-insensitive intervention. We treat constant intervention as a baseline rather than a decoding strategy, which is why it is presented in Table 1 but not included as part of Section 3’s framework.

Question-4: Table 3 is not referenced in the main text. Also, it appears that the average reasoning length does not significantly increase, which seems at odds with the intervention’s stated goal.

Reply to Question-4: Thank you for pointing this out. We acknowledge the omission and will explicitly reference Table 3 in the revised version of the main text. This was an oversight during formatting and editing.

Regarding the second part of your question: while it's true that reasoning length (measured by word count) does not increase dramatically under our intervention, our goal is not merely to extend the output length. Instead, we aim to elicit meaningful long chain-of-thought reasoning, which includes self-reflective behavior, logical error correction, and more accurate intermediate step analysis.

To clarify this distinction, we added a comparison table below including a no-finetuning baseline and other methods for reference. We emphasize two key takeaways:

1.Performance improvements: Despite not fully fine-tuning, our method brings strong gains over the no-FT baseline and competitive results compared to LoRA.

2.Length & reflection tradeoff: Although reasoning length increases moderately, it is accompanied by increased self-reflection behavior and better answer accuracy—demonstrating that the quality and deliberativeness of reasoning improves, not just verbosity.

We appreciate your suggestion and will make both the Table 3 reference and this nuanced explanation clearer in the final revision.

Method	Math500 (tokens)	AMC23 (tokens)	GPQA (tokens)
Without FT	81.40 (556.13)	64.69 (797.83)	51.01 (598.23)
Full Finetuning	91.60 (3642.71)	92.50 (14170.80)	69.19 (7770.40)
LoRA	91.60 (3952.61)	85.00 (14827.93)	66.17 (8508.25)
EELo-CoT (Ours)	90.20 (3754.20)	88.75 (7077.48)	70.02 (8593.46)

Question-5: Since activations are determined by inputs, have the authors examined whether specific textual patterns naturally induce high reflection activations? This might suggest alternative, more interpretable control signals.

Reply to Question-5: Yes, we have examined how specific textual patterns influence activation dynamics, and this motivated our use of the "wait" token as a reflection trigger. As shown in Figure 4, the presence of the “wait” token causes a sharp spike in activation values in certain long-CoT-related neurons, while common tokens like “the” or digit tokens like “1” produce little to no activation in the same positions. This suggests that the model has internally learned to associate “wait” with a mode switch into more reflective or cautious reasoning.

This observation led us to select "wait" as an interpretable, symbolic trigger for activating the long-CoT pathway. Unlike learned neuron triggers, this approach is transparent, easily controllable, and portable across models.

评论- Follow-Up on Reviewer

2025-08-06

Dear Reviewer,

Thank you again for your thoughtful feedback and constructive suggestions. We have addressed all your latest comments in detail in our reply. We sincerely hope these clarifications and edits help strengthen the paper. Please let us know if you have any remaining concerns or suggestions—we would deeply appreciate your thoughts.

Best regards, Authors

2025-08-07

Thank you for the detailed follow‑up. The clarifications on the distinction between the amplification‑factor studies and the full EELo‑CoT approach address my earlier concerns about redundancy. I also appreciate that you ran the suggested ablation of adaptive scheduling. The quantitative analysis of token‑level activation patterns strengthens the case for "wait" as a meaningful reflective trigger, and I'm glad to see you are exploring alternative triggers. Finally, unifying terminology and updating Figure 7 for consistency will improve clarity.

评论- Thank you

2025-08-09

Dear Reviewer

Thank you again for your support!

Sincerely, The Authors

2025-08-04

Thank you for the rebuttal. It clarifies several points, but I still have some questions and suggestions:

Are the amplification factor experiments in Figures 2 & 3 essentially the same method as the "Constant" and “Forcing + Constant” baselines in Table 1, but evaluated in a smaller controlled setting? If so, I suggest integrating these results into the main results section alongside Table 1, and using Section 2 primarily to describe the neuron selection process. This would reduce redundancy and make the paper structurally easier to follow.
Table 1 should include a baseline that applies your decay‑based adaptive scheduling without the “wait” trigger. This would help isolate the effect of dynamic modulation from the forcing heuristic and clarify the relative contribution of each component to EELo‑CoT’s gains.
The choice of “wait” still feels somewhat arbitrary in the current paper. Since you note in the rebuttal that other textual patterns could serve as triggers, I encourage exploring a few alternatives (e.g., uncertainty markers, domain‑specific cues). This would make the approach feel less tied to a single token and strengthen the case for its generality.
The different methods can benefit from clearer, more consistent descriptions. Currently, different names are used in different places (e.g., no “EELo‑CoT” label in Figure 7). It would be helpful to unify terminology across the text and figures, and to include all key baselines (constant, adaptive scheduling, forcing, EELo‑CoT) directly in Figure 7.

Overall, this paper definitely has merits: the activation‑based analysis is interesting, and the proposed methods are simple yet promising. However, I feel it would benefit from some rewriting. With a clearer experimental narrative, a bit more exploration of trigger patterns, and analysis of what high activation values are caused by (e.g., textual patterns, parameter norms, or other factors), this work could make a strong and more broadly applicable contribution.

评论- Reply to New Feedback

2025-08-05

Dear Reviewer

Thank you for your new insightful questions. We will reply to your question one by one, and also include the new discussion and experiments in our revised version. Thank you again for helping us improve this work!

Question-1: Are the amplification factor experiments in Figures 2 & 3 essentially the same method as the "Constant" and “Forcing + Constant” baselines in Table 1, but evaluated in a smaller controlled setting? If so, I suggest integrating these results into the main results section alongside Table 1, and using Section 2 primarily to describe the neuron selection process. This would reduce redundancy and make the paper structurally easier to follow.

Reply to Question-1: The two methods in Figure 2&3 and Table 1 are not the same one. Figures 2 & 3 are designed as controlled studies to isolate and analyze the impact of linearly amplifying the activation value on specific behaviors (e.g., accuracy and reflection). Thus, we only vary the amplification factor while holding all other variables the same as the original LLM inference process, as the proof-of-concept experiments. In contrast, Table 1 evaluates the performance of our designed approach, using a fitting function guided amplification schedule and rule-based token insertion strategy. It is to show the superiority of our proposed method. We agree that the distinction between these two setups could be made clearer in the paper. In the revised version, we will rewrite the two parts, to highlight their different focuses, and reduce potential redundancy.

Question-2: Table 1 should include a baseline that applies your decay‑based adaptive scheduling without the “wait” trigger. This would help isolate the effect of dynamic modulation from the forcing heuristic and clarify the relative contribution of each component to EELo‑CoT’s gains.

Reply to Question-2:

We agree that isolating the effect of our decay-based adaptive scheduling from the “wait” token trigger is important for disentangling the contributions of each component in EELo-CoT. Following the suggestion, we conducted an additional ablation study on GPQA where we applied the decay-based adaptive activation modulation without inserting the “wait” token. Specifically:

The dynamic schedule was applied globally across the output sequence No special tokens (such as “wait”) were inserted to trigger the intervention.

Model Name	Accuracy	Length (words)	Self-reflection
Qwen 2.5 7B baseline	30.30%	457.34	4.04%
Qwen 2.5 7B + Ours without Wait Token	33.33%	592.08	73.74%
Qwen 2.5 7B + Ours	35.86%	585.78	68.18%

As shown, applying the decay-based modulation alone leads to gains in accuracy, but can not significantly improve the self-reflection rate and length over the baseline. As a comparison, our methods can further improve the performance in Accuracy and Self-reflection rate, indicating a better eliciting of the long-CoT reasoning ability. This confirms that the integration of the activation control mechanism and wait trigger token plays important and complementary roles in our approach.

评论- Reply to New Feedback - Part2

2025-08-05

Question3:

The choice of “wait” still feels somewhat arbitrary in the current paper. Since you note in the rebuttal that other textual patterns could serve as triggers, I encourage exploring a few alternatives (e.g., uncertainty markers, domain‑specific cues). This would make the approach feel less tied to a single token and strengthen the case for its generality.

Reply to Question-3:

Thank you for the valuable suggestion. Figure 4 is indeed designed to qualitatively illustrate the contrast in activation values between reflective triggers (e.g., “wait”) and neutral tokens (e.g., “the”, digits). While the qualitative pattern is compelling, we agree that adding quantitative metrics would strengthen the evidence.

To address this, we include in the revision a quantitative analysis that measures:

1.The average activation magnitude of long-CoT-related neurons when each token appears. 2.The proportion of selected neurons that exceed a threshold (e.g., activation > 4) for each token type.

Token	Avg Activation	%>0.4
Wait	0.816	5.3%
The	-0.1322	0.0%
Equals	0.116	2.0%

As shown, the “wait” token reliably elicits higher average activation and activates a greater fraction of long-CoT neurons compared to neutral tokens like “the” or mathematical operators like “equals”. This provides quantitative confirmation that certain tokens (e.g., “wait”) serve as interpretable triggers for switching the model into a reflective reasoning mode.

We will include this table in the supplementary material and reference it in the revised discussion of Figure 4.

However, we fully agree that relying on a single token may limit generality. To strengthen this point, we have begun exploring alternative textual triggers, including: Uncertainty markers: e.g., “Hmm”, “I think”, “Let me reconsider.” Mathematical operators or verbal reasoning cues: e.g., “step by step”, “considering that.” Domain-specific prompts: e.g., “Recalling the definition…” in academic tasks

This opens an exciting direction for future work: learning optimal intervention triggers, either through data-driven discovery or prompting strategies.

Question 4:

The different methods can benefit from clearer, more consistent descriptions. Currently, different names are used in different places (e.g., no “EELo‑CoT” label in Figure 7). It would be helpful to unify terminology across the text and figures, and to include all key baselines (constant, adaptive scheduling, forcing, EELo‑CoT) directly in Figure 7.

Reply to Question-4:

Thank you for this thoughtful and constructive suggestion. We fully agree that consistency in naming and presentation is important for clarity and readability. Following this suggestion, we will revise the manuscript to update Figure 7 with clear legends and color codes that align with the terminology used in Table 1 and Section 4. Add a summary caption or callout to reinforce which method corresponds to our proposed full approach. We appreciate your attention to detail and believe these improvements will make the experimental comparisons more accessible and rigorous.

评论- Reply to Reviewer

2025-08-07

Dear Reviewer

Following your suggestion, we have extended our analysis to test on other potential triggering patterns. Specifically, we replaced the "wait" token by the "however" token. The result is shown as below:

Model Used	Accuracy	Length (word counts)	Self-reflection
Qwen2.5 7B base model	30.30%	457.34	4.04%
Qwen2.5 7B base model (Ours, wait token inserted)	35.86%	585.78	68.18%
Qwen2.5 7B base model (Ours, however token inserted)	35.35%	939.95	41.41%

Inserting trigger tokens significantly improves the reasoning behavior of the Qwen2.5 7B base model. Specifically, inserting the "wait" token yields the highest gains — increasing accuracy from 30.30% to 35.86% and self-reflection rate from 4.04% to 68.18%. The "however" token also improves performance over the base.

We fully agree with your points, and we are already undertaking substantial changes to the paper in response.

Here is an outline of the major revisions we are making:

Integration of amplification factor experiments We will restructure the content in Figures 2 & 3, merging the results into the main results section (Table 1), and revise Section 2 to focus solely on the neuron selection methodology. This should reduce redundancy and improve clarity.
Adding a no-trigger baseline We will add a new row in Table 1 to include the decay-based adaptive scheduling without the “wait” trigger, helping isolate the effect of dynamic modulation from the heuristic.
Unifying naming and presentation across figures We will revise figure captions and legends (particularly Figure 7) to ensure all methods are consistently named and labeled (e.g., “Constant”, “Adaptive”, “Forcing”, “EELo-CoT”), with a legend added to improve readability.

Unfortunately, NeurIPS does not allow uploading a revised manuscript during the rebuttal phase, but we have already begun these revisions.

Thank you again for your valuable input — your comments were sharp, constructive, and very much aligned with our goals for improving the clarity and generality of this work.

Sincerely, Authors

审稿意见

评分: 5置信度: 52025-07-02

The authors investigate how to efficiently extract the long CoT reasoning ability in LLMs without the need for expensive reinforcement learning or supervised fine-tuning on high-quality data. The authors discover that a small set of high-impact activations in the last few layers of LLMs are crucial for enabling long-form reasoning behaviors, such as extended output length and self-reflection. By amplifying these activations and inserting special character "wait" tokens during inference, the model's long CoT abilities can be invoked without additional training, resulting in significant improvements in both self-reflection rates and answer accuracy.

This paper provides a new insights into the internal mechanisms of LLM reasoning and experimental results on challenging math and science reasoning benchmarks (Math500, AMC23, GPQA) show that the proposed methods significantly improve both accuracy and self-reflection rates, often matching or surpassing full fine-tuning and LoRA baselines while requiring much less computation and data.

优缺点分析

Strengths:

The paper introduces a training-free activation control technique that can elicit long COT reasoning in LLMs by amplifying a small set of key activations in the last layers and inserting “wait” tokens during inference. This method does not require expensive reinforcement learning or large-scale supervised fine-tuning, making it highly efficient and broadly applicable. The connection between activation amplification and reasoning quality is largely empirical without deep mechanistic understanding. The logarithmic decay function $(f(t) = a - b\cdot \log(t + c))$ appears ad-hoc and is fitted to limited data without theoretical grounding
It is interesting to see the extensive empirical analysis showing that long-CoT-related activations are concentrated in the last few layers of LLMs, and that amplifying these activations reliably improves both answer accuracy and the frequency of self-reflection across multiple benchmarks (Math500, AMC23, GPQA)
Authors offers clear mechanistic insight into how long-form reasoning emerges in LLMs, identifying sparse, predictable activation patterns (notably a sharp rise and logarithmic decay after trigger tokens like “wait”) that can be analytically modeled and controlled.
All training and inference details, including data splits, hyperparameters, and intervention schedules, are specified in the paper and accompanying code/configs.

Weakness:

One evident weakness is that while the paper identifies that amplifying a small set of activations in the last few layers elicits long CoT, it does not provide a deep mechanistic explanation of why these activations govern such reasoning. The approach is largely empirical, and the mapping between specific neurons and reasoning behaviors remains mostly correlative, not causal. It would be really interesting to see why this is happening.
The intervention is heavily dependent on the presence and timing of special tokens like "wait." If the model does not generate or respond to these tokens as expected, the intervention may not activate, leading to inconsistent results.
After looking at the supplement materials, the code relies on manually selected hyperparameters (e.g., amplification factors, number of neurons, t_max, cooldown windows). These are tuned for specific models and datasets, and there is no robust procedure for adapting them to new settings. Small changes in these values led to instability or even degraded performance
The answer extraction and equivalence checking rely on regex and symbolic math parsing, which can fail or produce false positives/negatives for non-standard output formats. This can inflate or deflate reported accuracies.
The method uses R1-distilled models to identify "long-CoT" activations, then applies these to base models, potentially introducing bias toward the specific distillation approach used.
I am curious and was unable to find which aspects of the intervention (activation amplification vs. wait token insertion) contribute most to improvements. hence, claims about "any LLMs and any datasets" are not sufficiently supported by the limited experimental validation and the paper conflates correlation between activations and long-CoT behavior with causal relationships

问题

The core evaluation metric for "self-reflection" in your code and experiments is the presence of certain keywords (e.g., "wait", "let me double check") in the output, and your main accuracy improvements are modest (often within a few percentage points). Given that your intervention logic (as implemented in intervene_functions.py and model.py) forcibly injects these tokens and amplifies activations based on brittle, rule-based triggers (such as digit counts or sentence boundaries), how can you rule out that your method is not simply gaming the metric—producing longer, more reflective-sounding outputs—without any substantive improvement in actual multi-step reasoning or problem-solving ability? Have you validated that the reasoning chains are more logically coherent or correct, beyond just being longer or containing more "reflection" phrases?
Author's method assumes that amplifying a small, empirically selected set of "long-CoT-related" activations in the last layers is sufficient to elicit genuine long chain-of-thought reasoning. However, your selection process uses a fixed threshold on activation differences from just 160 contrastive pairs and does not account for the vast combinatorial space of possible reasoning patterns or tasks. How do authors justify that this sparse, static intervention—based on such a limited and potentially unrepresentative sample—captures the true causal mechanisms of reasoning in large language models, rather than merely inducing superficial verbosity or keyword repetition?
Why did authors not include comparisons to other interpretability-based intervention techniques? For example, would random neuron amplification or attention head interventions yield similar effects? I am curious if you also took a look at these papers and methods to analyze the attention [1], [2]

[1] Olsson, Catherine, et al. "In-context learning and induction heads." arXiv preprint arXiv:2209.11895 (2022).

[2] Wang, Zhenyu. "Logitlens4llms: Extending logit lens analysis to modern large language models." arXiv preprint arXiv:2503.11667 (2025).

Have authors tested their approach on tasks requiring different types of reasoning (e.g., commonsense, logical, or open-domain QA)? If not, how can you be sure the intervention generalizes?

局限性

Yes.

最终评判理由

The authors have made sure to give all the answers of my question along with detailed experiments on Long-CoT with Qwen-Base and other LLMs.

格式问题

No paper formatting concerns

作者回复

2025-07-31

Dear Reviewer, We sincerely appreciate your comments and feedback which helps to improve our paper. We reply to all the concerns raised in the weaknesses and questions part.

Weakness-1: One evident weakness is that while the paper identifies that amplifying a small set of activations in the last few layers elicits long CoT, it does not provide a deep mechanistic explanation of why these activations govern such reasoning. The approach is largely empirical, and the mapping between specific neurons and reasoning behaviors remains mostly correlative, not causal. It would be really interesting to see why this is happening.

Reply to Weakness-1: We appreciate this insightful comment. Indeed, providing a full mechanistic and causal explanation for complex behaviors such as long chain-of-thought (CoT) reasoning remains an open challenge in interpretability research. Existing literature has theoretically discussed it from different perspectives[1,2,3]. Concretely, recent theoretical analyses have begun to formalise and critique CoT reasoning. For instance, Ton et al. (2024) treat each CoT step as an information-gain operation, revealing when apparent reasoning is vacuous; Barez et al. (2025) demonstrate that verbalised chains are often unfaithful and call for activation-level causal validation; Chi et al. (2025) diagnose the limits of current causal reasoning and provide new evaluation benchmarks. These works collectively motivate our focus on minimal, causally-validated activation interventions rather than exhaustive symbolic explanations.

[1] Ton, Jean-Francois, Muhammad Faaiz Taufiq, and Yang Liu. "Understanding chain-of-thought in llms through information theory, 2024." URL https://arxiv. org/abs/2411.11984.

[2] Barez, Fazl, et al. "Chain-of-thought is not explainability." Preprint, alphaXiv (2025): v2.

[3] Chi, Haoang, et al. "Unveiling causal reasoning in large language models: Reality or mirage?." Advances in Neural Information Processing Systems 37 (2024): 96640-96670.

However, given the vast number of parameters and nonlinear dynamics in large language models (LLMs), it is hard to propose a completely accurate mechanistic or causal explanation for this phenomena. Thus, our primary goal in this paper is to identify a minimal and actionable set of activation patterns that are sufficient to elicit this behavior reliably, not to exhaustively explain why certain neurons govern reasoning. The strength of our approach lies in its efficiency and effectiveness—we demonstrate that amplifying a sparse set of MLP activations in the final transformer layers leads to meaningful improvements in accuracy, CoT length, and self-reflection rates. While our method is empirical, it offers useful causal evidence through targeted interventions, suggesting that these neurons are functionally significant, not merely correlated. Our findings will provide a useful basis for future studies aiming to formally characterize the causal pathways underlying high-level reasoning in LLMs.

Weakness-2: The intervention is heavily dependent on the presence and timing of special tokens like "wait." If the model does not generate or respond to these tokens as expected, the intervention may not activate, leading to inconsistent results.

Reply to Weakness-2: We strongly agree that the presence and timing of the “wait” token matters in our method. The “wait” token acts as a behavioral trigger and then high-activation regions of long chain-of-thought (CoT) neurons can better guide the LLM to generate accurate results. Without our wait token insert strategy, models do not always generate such tokens at the right moments, leading to unexpected low self-reflection rate. Thus, our approach explicitly inserts the “wait” token at strategic points in the generation process. This is not a reliance on random chance, but rather a controlled intervention designed to nudge the model toward reflection behavior when necessary. In the following table, we show a pair of comparison examples with and without using wait token. Without the wait token, the model confidently settles on an incorrect answer With the wait token and activation modulation, the model pauses, reevaluates its reasoning steps (Step 7 → Step 8 → Step 9), and ultimately arrives at the correct answer. This illustrates that our designed wait-token-insert strategy can lead to better timing and presence to enable recovery from errors by engaging deeper reasoning.

Model Used	Without wait token but with Activation values	With wait token + Activation values
Qwen 2.5 7B base	To answer this question, let's analyze the results of the experiment step by step. ... Based on these results, we can conclude that: - G2 is a transcription factor because its absence results in complete susceptibility to anthracnose. - G1 and G3 show pleiotropy because both have some resistance to anthracnose, but their combined effect is less than expected. - G1 is epistatic towards G3 because the g1g3 double-mutant has less resistance than either g1 or g3 alone. Therefore, the most likely answer is \boxed{A}. Incorrect Answer	Step 7: Determine the epistatic relationship between G1 and G3. - G3 is epistatic towards G1 because the presence of G3 (50% resistance) reduces the effect of G1 (75% resistance) to 10% resistance. Wait, that's not the correct answer. Let's reanalyze the results. Step 8: Reanalyze the results of the ... The correct answer is \boxed{B}. Correct Answer

Weakness-3: After looking at the supplement materials, the code relies on manually selected hyperparameters (e.g., amplification factors, number of neurons, t_max, cooldown windows). These are tuned for specific models and datasets, and there is no robust procedure for adapting them to new settings. Small changes in these values led to instability or even degraded performance.

Reply to Weakness-3: Our training-free method (Section 3) involves a set of manually selected hyperparameters, such as amplification factor, number of neurons, and cooldown window. Since this approach applies direct interventions without gradient-based optimization, some sensitivity to these hyperparameters is indeed expected—this is a known trade-off for training-free interpretability methods [4,5].

[4] Bansal, Naman; Agarwal, Chirag; Nguyen, Anh. “SAM: The Sensitivity of Attribution Methods to Hyperparameters.” arXiv preprint arXiv:2003.08754 (2020).

[5] Novello, Paul; Poëtte, Gaël; Lugato, David; Congedo, Pietro Marco. “Goal‑Oriented Sensitivity Analysis of Hyperparameters in Deep Learning.” arXiv preprint arXiv:2207.06216 (2022).

To address this limitation, we proposed a parameter-efficient trainable method in Section 4, which allows the model to automatically learn context-dependent amplification behavior. Specifically: We introduce a lightweight Activation Amplification Module that learns to scale neuron activations based on the input context. This module is applied only to a small subset of identified long-CoT neurons in the last layer. We combine it with LoRA layers in earlier layers to enable broader representation adaptation, while keeping most model parameters frozen. Despite updating only 1.51% of the total parameters, our trainable method achieves strong performance across benchmarks: on Math500, accuracy improves to 90.2%, close to full fine tuning (91.6%), while reducing token length. On AMC23, we achieve 88.75% accuracy, outperforming LoRA (85.0%). On GPQA, our method attains 70.02% accuracy, outperforming both LoRA (66.17%) and full finetuning (69.19%). This result shows that our trainable method can learn the appropriate intervention dynamics in a robust and data-driven way, avoiding manual tuning while maintaining efficiency.

Benchmark	Our Trainable Method Accuracy	Full Fine Tuning Accuracy	LoRA Accuracy
Math500	90.20%	91.60%	91.60%
AMC23	88.75%	92.50%	85.0%
GPQA	70.20%	69.19%	66.17%

Weakness-4: The answer extraction and equivalence checking rely on regex and symbolic math parsing, which can fail or produce false positives/negatives for non-standard output formats. This can inflate or deflate reported accuracies.

Reply to Weakness-4: Indeed, symbolic math parsing and regex-based methods can occasionally fail, especially for non-standard or free-form outputs. To minimize this issue, we adopted a state-of-the-art answer extraction and equivalence checking approach from recent works [6]. While no extraction method is perfect, this approach ensures high coverage and low false-positive/negative rates in practice. More importantly, we apply the same extraction pipeline across all baselines and ablation settings, ensuring experimental fairness and comparability when comparing the improvement of our methods with baselines.

[6] Yang, An, et al. "Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement." arXiv preprint arXiv:2409.12122 (2024).

Weakness-5: The method uses R1-distilled models to identify "long-CoT" activations, then applies these to base models, potentially introducing bias toward the specific distillation approach used.

Reply to Weakness-5: We would like to clarify that our method does not rely on R1-distilled models to extract or apply long-CoT activations. In Figure5, we analyzed that the activation patterns on the R1-distilled models are highly similar to the base model, particularly in the sparsity and dynamics. Based on this, we can just do experiments on the base model to extract or apply long-CoT activations.

评论- Rebuttal-Part3

2025-08-05

Question-1: The core evaluation metric for "self-reflection" in your code and experiments is the presence of certain keywords (e.g., "wait", "let me double check") in the output, and your main accuracy improvements are modest (often within a few percentage points). Given that your intervention logic (as implemented in intervene_functions.py and model.py) forcibly injects these tokens and amplifies activations based on brittle, rule-based triggers (such as digit counts or sentence boundaries), how can you rule out that your method is not simply gaming the metric—producing longer, more reflective-sounding outputs—without any substantive improvement in actual multi-step reasoning or problem-solving ability? Have you validated that the reasoning chains are more logically coherent or correct, beyond just being longer or containing more "reflection" phrases?

Reply to Question-1: Through looking at many examples, we discovered that the accuracy of intermediate steps indeed improves a lot, especially after the self-reflection operation. But accurately evaluating the intermediate steps within the chain of thought is still an open question, so we have not found a usable qualified metric. Here, we give some examples for demonstration. We can see that after self-reflection, our method successfully elicits the LLM to check and revise the intermediate reasoning results, leading to accurate final answers.

Model Used	Case study 1 with intervention	Case study 2 with intervention	Case study 3 with intervention
Qwen 2.5 7B base	... So, the answer is (A) 10. But why did I get this wrong? Because I made a mistake when I was reading the question! I thought I was supposed to count the total number of carbon atoms in all three products, but I only needed to count the number of carbon atoms in product 3. Let me try again... After re-reading the question, I realize that I should only be counting the number of carbon atoms in product 3. And since the reactions didn't change the number of carbon atoms, the answer is indeed (A) 10.	... Step 7: Determine the epistatic relationship between G1 and G3. - G3 is epistatic towards G1 because the presence of G3 (50% resistance) reduces the effect of G1 (75% resistance) to 10% resistance. Wait, that's not the correct answer. Let's reanalyze the results. Step 8: Reanalyze the results of the g1g3 mutant. - g1g3: 10% resistance - g1: 75% resistance - g3: 50% resistance Step 9: Determine the epistatic relationship between G1 and G3. - G1 is epistatic towards G3 because the presence of G1 (75% resistance) reduces the effect of G3 (50% resistance) to 10% resistance. The correct answer is \boxed{B}. Correct Answer	... This is because the eigenvector corresponding to the eigenvalue $+\hbar/2$ for the Pauli matrices is $\begin{pmatrix} \cos(\theta/2) \\ \sin(\theta/2) \end{pmatrix}$ . Wait, I made a mistake in the normalization. The correct normalized eigenvector is:
$
\vec{v} = \begin{pmatrix} \cos(\theta/2) \ \sin(\theta/2) \end{pmatrix}
$
Therefore, the correct answer is: \boxed{C} Correct Answer

Question-2: Author's method assumes that amplifying a small, empirically selected set of "long-CoT-related" activations in the last layers is sufficient to elicit genuine long chain-of-thought reasoning. However, your selection process uses a fixed threshold on activation differences from just 160 contrastive pairs and does not account for the vast combinatorial space of possible reasoning patterns or tasks. How do authors justify that this sparse, static intervention—based on such a limited and potentially unrepresentative sample—captures the true causal mechanisms of reasoning in large language models, rather than merely inducing superficial verbosity or keyword repetition?

Reply to Question-2: In this work, we select 160 contrastive pairs for reducing the data construction cost and meanwhile efficiently testing the proof-of-concept of our motivation. Our goal in this work is not to claim a complete or exhaustive characterization of the causal mechanisms behind long chain-of-thought (CoT) reasoning, but rather to provide an efficient and practical method to elicit this ability using sparse, interpretable interventions. Therefore, we hope the entire method is low cost. For example, using a small dataset (160 pairs), a model's own generated result (from base model), a small number of activation values (40), and a simple intervention (non-parameteric function). Therefore, we let the readers know that this pathway is workable and enlighten the future works.

评论- Rebuttal-Part2

2025-08-05

Weakness-6: I am curious and was unable to find which aspects of the intervention (activation amplification vs. wait token insertion) contribute most to improvements. hence, claims about "any LLMs and any datasets" are not sufficiently supported by the limited experimental validation and the paper conflates correlation between activations and long-CoT behavior with causal relationships

Reply to Weakness-6: In our experiments (Table 1),

Model Used	Accuracy	Length (word counts)	Self-reflection
Qwen2.5 7B base model (wait token only)	33.33%	598.06	68.69%
Qwen2.5 7B base model (Activation only)	33.33%	466.95	5.56%
Qwen2.5 7B base model (wait token + Activation)	31.31%	598.22	71.72%
Qwen2.5 7B base model (EELo-CoT ours)	35.86%	585.78	68.18%

we conduct detailed ablations, to study the contribution of activation amplification and wait token insertion. (1) Using only wait token insertion significantly increases the self-reflection rate, but has limited and inconsistent effects on accuracy. (2) Using only activation amplification improves accuracy, but does not reliably induce self-reflective behavior. (3) Combining both leads to the most robust and consistent gains in both accuracy and self-reflection across all benchmarks. These results suggest that each component contributes independently to a different aspect of long-CoT behavior, and their combination yields the strongest effect. Regarding causality, we demonstrate functional causality—i.e., manipulating specific neurons produces reliable changes in model behavior. Such a way is more useful for us to devise the efficient and effective long CoT ability eliciting methods, including the training-free and parameter-efficient fine-tuning approaches. For the words about any LLMs and any datasets, we agree that more validation across more domains and architectures are necessary to support the claim. We will remove the overclaim words and conduct more experiments for validating the generality.

评论- Kindly Notification of Approaching Discussion Due

2025-08-05

Dear Reviewer,

Thank you again for your initial review and positive feedback. We have carefully addressed your concerns in our rebuttal and provided detailed explanations along with additional experimental results to support our claims.

We would greatly appreciate it if you could review our response and share your feedback. If there are any remaining concerns that require further clarification, we would be more than happy to provide more details and have a further discussion.

Best regards, The authors

评论- Rebuttal-Part4

2025-08-05

Question-3: Why did authors not include comparisons to other interpretability-based intervention techniques? For example, would random neuron amplification or attention head interventions yield similar effects? I am curious if you also took a look at these papers and methods to analyze the attention [1], [2]

Reply to Question-3: Thank you for the thoughtful suggestion and for referencing relevant interpretability-based works. Following the suggestion, we compare our method against two intervention techniques, including:

(1) Full Activation Vector Transfer (Causal Tracing Style): we extracted MLP activation vectors from multiple layers of DeepSeek-R1-Distill-Qwen-7B for positive and negative pairs, and computed layer-wise activation differences and patched them into a base model Qwen2.5-7B during inference. (2) Random Neuron Amplification (with wait token): we randomly selected neurons to amplify, and follow the same other used settings in EELo-CoT.

The result on GPQA dataset is shown in the table below. We can see that both variations perform not better than our methods. The reason is that our EELo-CoT can better capture the fine-grained main factors for the long CoT ability, i.e., few activation values. Such a sparse intervention way reduces the influence of other confounders and also guarantees the strength of controlling.

Model Name	Accuracy	Length (words)	Self-reflection
Qwen 2.5 7B baseline	30.30%	457.34	4.04%
Qwen 2.5 7B + Full Activation Tracing	33.33%	592.08	73.74%
Qwen 2.5 7B + Random Neuron	27.78%	535.12	48.38%
Qwen 2.5 7B + Ours	35.86%	585.78	68.18%

Question-4: Have authors tested their approach on tasks requiring different types of reasoning (e.g., commonsense, logical, or open-domain QA)? If not, how can you be sure the intervention generalizes?

Reply to Question-4: Thank you for raising this important point regarding generalizability across reasoning types. To evaluate the broader applicability of our method beyond arithmetic and factual tasks, we conducted additional experiments on the commonsense reasoning benchmark: CommonSenseQA. We applied our EELo-CoT method to the Qwen2.5-7B base model and compared performance against the non-intervened baseline. These results show that our intervention can lead to improvement, but not as significant as in complex reasoning tasks. The reason is that the commonsense reasoning task is relatively easier and originally has higher accuracy, hence the long-CoT reasoning ability will not lead to great gain.

Dataset: common-sense-qa

Method	Accuracy
baseline	75.10%
EELo-CoT	75.43%

评论- Kindly Notification Before Discussion Due

2025-08-06

Dear Reviewer,

We sincerely thank you for your detailed and constructive comments. We have carefully addressed all the identified weaknesses and questions in our previous responses, including:

1.The causality vs correlation concern in activation amplification, 2.Evaluation of trigger sensitivity and generalization across tasks, 3.Comparison to interpretability-based baselines (including random interventions), 4.Clarification on static intervention design and reasoning diversity, 5.Supplementing the "self-reflection" evaluation with more detailed metrics and examples

Since the rebuttal phase ends very soon, we would greatly appreciate any final feedback or confirmation that our responses addressed your concerns.

Your input would be invaluable in helping us improve the clarity and rigor of the paper.

Best regards, The Authors

评论- Official Response by Reviewer

2025-08-07

I thank the authors for taking their time and efforts for answering all the questions. I have not only read my but also other reviewer's answers. However, Please do add these results of our discussion in the manuscript. I will incorporate my score accordingly.

评论- Reply to the Reviewer

2025-08-07

Dear Reviewer

We sincerely thank the reviewer for his or her thoughtful engagement throughout the rebuttal phase. We greatly appreciate your careful consideration of both our responses and the broader conversation. As suggested, we will incorporate the relevant results and clarifications from the discussion into the final version of the manuscript to ensure completeness and clarity.

Sincerely, Authors

评论- Thank you

2025-08-09

Dear Reviewer

Thank you again for your support!

Sincerely, The Authors

评论- Reply to the Reviewer

2025-08-07

Dear Reviewer

Sincerely, Authors

审稿意见

评分: 4置信度: 22025-07-03

This paper investigates how the distribution of activation values affects the generation of long CoT reasoning in language models. The authors provide several empirical observations showing that activation values across layers are correlated with reasoning performance. Building on these findings, they introduce a training-free activation control method that aims to improve CoT reasoning without expensive additional training.

优缺点分析

Strengths:

The paper tackles an interesting and underexplored question: how internal activations influence a model’s reasoning ability. This perspective could open new directions for more efficient CoT elicitation.
The proposed activation control method is training-free and potentially much more efficient than conventional approaches like RL or SFT.

Weaknesses:

The paper lacks clear definitions and clarifications for several key concepts. For example, the terms “(key) activation values,” “activation value difference,” “positive and negative samples” (Line 86), and “self-reflection rate” (Line 120) are not rigorously defined. This makes it hard to fully understand and reproduce the method.
Finding-3 could be better justified. The “significant improvement” in accuracy in Figure 2 is not clearly visible, and the increase in self-reflection rate might simply result from adding a “wait” token rather than the activation amplification itself. Have the authors tried adding the “wait” token alone in Figures 2 and 3? This would help isolate the effect of the proposed activation manipulation.
The paper would benefit from more careful discussion about potential confounding factors. For example, how does adding tokens interact with the model’s natural decoding dynamics?

问题

In different figures, the scale of activation values varies widely (e.g., up to 1000 in Figure 1 but close to zero in Figures 4–6). What accounts for this discrepancy? Are the scales normalized or comparable?
In Eq. (1), how are the parameters $a$ , $b$ , and $c$ determined? Are they specific to each model or layer, or are they fixed? Could they be made trainable for better results?
What exactly is the relative distance $t$ in Eqs. (1) and (2)? How is it computed in practice?

局限性

yes

最终评判理由

The authors have addressed most of my concerns. Given that the authors have committed to incorporating these detailed discussion in the revision, I decided to increase my score from 3 to 4.

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer,

We sincerely appreciate your comments and feedback which helps to improve our paper. We reply to all the concerns raised in the weaknesses and questions part.

Weakness-1: Limited definitions and clarifications: lacks clear definitions and clarifications for several key concepts. For example, the terms “(key) activation values,” “activation value difference,” “positive and negative samples” (Line 86), and “self-reflection rate” (Line 120) are not rigorously defined.

Reply to Weakness-1: Due to the page limitation, we were unable to define all terms rigorously in the initial submission. In the revision, we plan to include a concise term-definition table to clarify all technical concepts. Below we provide definitions for the specific terms mentioned:

Key activation values: Activation values refer to hidden states in the MLP activation (act_fn) layers of the model. The key activation values are the ones with larger values in a certain layer, which generally indicate that important knowledge or ability is invoked during LLM inference[1,2].

[1] Shafran, Or; Geiger, Atticus; Geva, Mor. “Decomposing MLP Activations into Interpretable Features via Semi‑Nonnegative Matrix Factorization.” arXiv preprint arXiv:2506.10920 (2025).

[2] Garde, Albert; Kran, Esben; Barez, Fazl. “DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models.” arXiv preprint arXiv:2310.01870 (2023).

Activation value difference: This is the difference in average activation values between positive and negative samples. For a given MLP layer, we compute the mean activation over all positive outputs and over all negative outputs separately, and then compute the difference as: Δ = mean_positive − mean_negative. This allows us to identify neurons that are significantly more active in high-quality (e.g., accurate or reflective) generations.

Positive and negative samples: As described in Section 2.1, we select contrastive examples from model outputs to help identify key activation values. Specifically: Positive samples are CoT responses that are accurate, long (length > 1000 tokens), and contain self-reflection cues such as "wait", "let me double check", etc. Negative samples are responses that are inaccurate, shorter, and lack reflective phrases.

Self-reflection rate: This metric is to measure the percentage of self-reflection outputs generated by the model. If the output contains at least one predefined self-reflective phrase, such as "wait", "let me double check", or "let’s verify", we classify it as positive one, and otherwise negative. This is a binary classification per sample and we average it over each dataset to compute the metric. We will include a term description table in the revised version to improve the understanding of all the terms.

Weakness-2: Finding-3 could be better justified. The “significant improvement” in accuracy in Figure 2 is not clearly visible, and the increase in self-reflection rate might simply result from adding a “wait” token rather than the activation amplification itself. Have the authors tried adding the “wait” token alone in Figures 2 and 3? This would help isolate the effect of the proposed activation manipulation.

Reply to Weakness-2: Our intention in Figures 2 and 3 was to evaluate whether combining activation amplification with the insertion of the “wait” token yields greater improvements than using activation amplification alone. This week, we followed the suggestion of the reviewer and test the wait-token-only method performance in the following table. The results show that while inserting the wait token alone does increase the self-reflection rate, its effect on accuracy is limited. In contrast, the combination of activation amplification and wait token insertion leads to a more substantial and consistent improvement in both accuracy and self-reflection. This supports our claim that the two components are complementary.

To better show the comparison of the two methods, we randomly select one example as follows. We can see that although wait-token-only enforces the LLM to perform “self-reflection”, but the generated contents are just repeating the question. It is just like an imitation of reflection without thinking. In comparison, with the activation control strategy, the LLM starts to think deeply and check the correction of the answer in the self-reflection parts.

Inference on GPQA benchmark

Model Name	Add Wait Token Only	Add Token and Activation Values Applied
Qwen 2.5 7B	33.33% Acc	35.86% Acc
Qwen 2.5 Math 7B	32.83% Acc	37.88% Acc

Model Used	Wait Token Only	Wait Token + Activation Values
Qwen 2.5 7B base	Wait, we're not given the products 1 and 2. Let's go through the steps again... how many carbon atoms are there in product 3? (A) 10 (B) 12 (C) 14 (D) 11. We're still not given the products. Let's go through the steps again... how many carbon atoms are there in product 3? (A) 10 (B) 12 (C) 14 (D) 11. ... Repeating infinitively	... So, the answer is (A) 10. But why did I get this wrong? Because I made a mistake when I was reading the question! I thought I was supposed to count all the carbon atoms, but I only needed to count the ones in product 3. Let me try again... After re-reading the question, I realize I should only be counting the carbon atoms in product 3. Since the reactions didn't change them, the answer is indeed (A) 10. Sorry about that, and I hope this explanation helps!

Weakness-3: The paper would benefit from more careful discussion about potential confounding factors. For example, how does adding tokens interact with the model’s natural decoding dynamics?

Reply to Weakness-3: We appreciate the reviewer’s concern about potential confounding factors, particularly regarding how adding tokens like "wait" may affect the model’s natural decoding dynamics. In our previous experiments, we found that adding the “wait” token influences the generation behavior by encouraging the model to imitate the self-reflective behavior. However, our findings suggest that while the presence of the token can trigger self-reflection behavior, it does not ensure the quality or correctness of that reflection. The precision and effectiveness of self-reflection are largely governed by the activation values of specific neurons we identify and modulate. (See the case study above) This distinction is supported by our ablation results: inserting the “wait” token alone increases the self-reflection rate but has limited effect on accuracy. Conversely, when combined with activation amplification, we observe consistent improvements in both reflection and correctness. This indicates that the interaction between token-level cues and neuron-level control is synergistic. (See the experiment results above)

In addition, we also consider the following potential variables in our methods, and add the ablation studies in the following table. Concretely, we conducted two additional studies:

1.Random Neuron Amplification (with wait token): We randomly selected activations to amplify using the same decay schedule as EELo-CoT. The results (on AMC23) are shown below:

Method	Accuracy	Length (words)	Self-reflection
Qwen2.5-7b baseline	45.00%	436.15	7.50%
Qwen2.5-7b EELo-CoT	37.50%	570.88	65.00%

These results confirm that simply amplifying arbitrary neurons—despite increasing reflection-like behavior—fails to improve accuracy, emphasizing the importance of targeted neuron selection.

2.Token-Level Activation Comparison: We also analyzed how different tokens affect the activations of long-CoT-related neurons. The table below reports the average activation magnitude and the percentage of neurons exceeding a threshold (e.g., >4.0) for each token:

Token	Avg Activation	% > 4.0
Wait	0.816	5.3%
The	-0.1322	0.0%
Equals	0.116	2.0%

As shown, the “wait” token activates the target neurons more strongly and more frequently than neutral tokens. This provides quantitative evidence that certain tokens (e.g., “wait”) function as interpretable triggers for switching the model into a more reflective mode.

To be honest, we acknowledge the existence of additional factors that may influence decoding dynamics (e.g., context length, token distribution). But it is hard to consider all potential confounders in this work. From the goal of proposing an efficient and effective approach, our main focus is to identify and validate the two dominant contributors: (1) token-based triggering and (2) activation-based control, for eliciting the long chain-of-thought ability. Then, based on that, we can devise our rather efficient training-free and parameter-efficient fine-tuning methods. (See Section 3 and 4).

评论- Rebuttal-Part2

2025-08-05

Question-1: In different figures, the scale of activation values varies widely (e.g., up to 1000 in Figure 1 but close to zero in Figures 4–6). What accounts for this discrepancy? Are the scales normalized or comparable?

Reply to Question-1: We sincerely thank the reviewer for your careful reading, and will revise this part to make the description more clear. Actually, the y-axis in Figure 1 and Figures 4–6 represent the count and magnitudes of activation values, respectively. Concretely, Figure 1 reports the number of activation values greater than 4 in each layer. It is a count of significantly activated values. In contrast, Figures 4–6 plot the actual activation values of selected neurons over the sequence or relative to specific tokens (e.g., “wait”). Figure 1 illustrates the distribution of high-activation neurons across layers, while Figures 4–6 aim to visualize the temporal dynamics and sparsity of key activation trajectories during generation. As shown in these figures, we can see that (1) Long-CoT Activations Mainly Exist in Last Few Layers; (2) Base and Long-CoT Models Exhibit Similar Sparse Activation Dynamics; (3) Activations around Wait Token Have Predictable Pattern. All the above findings motivate us to design our rather efficient training-free and fine-tuning methods.

Question-2: In Eq. (1), how are the parameters a,b and c determined? Are they specific to each model or layer, or are they fixed? Could they be made trainable for better results?

Reply to Question-2: To determine the parameters a,b, and c in Equation (1), we first collect activation trajectories from the identified long-CoT-related neurons, particularly focusing on their behavior around the “wait” token. We then fit the analytic function f(t)=a−b⋅log(t+c) to the empirical activation curves using standard curve fitting techniques. These parameters are derived from contrastive example pairs (Section 3) and are fixed during inference for the training-free method. In our training-free method, we use a rather simple way that applies the same fitted function globally, across layers and across models. This allows for a simple and easy-to-follow implementation, while still yielding strong performance, as shown in the following table.

Inference on GPQA benchmark

Method	Accuracy	Length (words)	Self-reflection
Qwen2.5-7b baseline	30.30%	457.34	4.04%
Qwen2.5-7b EELo-CoT	35.86%	585.78	68.18%

The observation that many long-CoT-related activations follow similar decay trends enables this shared approximation. We strongly agree with the reviewer that the optimal parameters may vary across different neurons or models. Therefore, in Section 4, we extend this idea into a trainable setting by designing an activation amplification module. Instead of fitting a single fixed function, this module learns context-dependent scaling factors for a selected subset of neurons, allowing the amplification behavior to be learned directly from data. This learnable approach results in even better performance, as shown in the following table.

GPQA

Method	Accuracy	Length (words )	Self-reflection
Qwen2.5-32b baseline	51.01%	598.23	28.28%
Qwen2.5-32b trained (Ours)	70.02%	8593.46	100%

To summarize:

1.Section 3 (training-free): fixed analytic function fitted once, efficient, zero training cost.

2.Section 4 (trainable): learnable amplification module, parameter-efficient training, better performance.

This design provides a trade-off between efficiency and performance, and users may choose the appropriate version depending on their deployment needs. We will make this distinction clearer in the revised version.

Question-3: What exactly is the relative distance in Eqs. (1) and (2)? How is it computed in practice?

Reply to Question-3: In Equations (1) and (2), the variable t represents the relative distance (in tokens) from a designated trigger token—typically the "wait" token—to another token in the same sentence. Concretely, during generation, whenever a "wait" token is inserted, we define it as the reference point (i.e., t=0). For each subsequent token in the same sentence, t is incremented by 1 per token until the sentence ends (e.g., at a period, newline, or EOS). The amplification function f(t) is then applied to adjust the activations of selected neurons based on their position relative to this "wait" token. This design reflects our empirical observation that activation values tend to spike immediately after the "wait" token and then decay over the following tokens. The function f(t)=a−b⋅log(t+c) captures this behavior, allowing us to dynamically modulate neuron activations during generation.

评论- Kindly Notification of Approaching Discussion Due

2025-08-05

Dear Reviewer,

Thank you again for your initial review. We have carefully addressed your concerns in our rebuttal and provided detailed explanations along with additional experimental results to support our claims.

Best regards, The authors

2025-08-06

I have read the authors' response and appreciate their efforts on the experimental evaluations. I believe the paper would benefit from incorporating those detailed discussions. I have no further questions and would like to increase my score.

评论- Thank you

2025-08-06

Dear Reviewer

Thank you again for your time and thoughtful reviews. We appreciate your engagement and feedback throughout the process, and we're glad to know our additional experiments helped address your concerns.

Sincerely, Authors

评论- Reply to the Reviewer

2025-08-07

Dear Reviewer

Sincerely, Authors

评论- Thank you

2025-08-09

Dear Reviewer

Thank you again for your support!

Sincerely, The Authors

审稿意见

评分: 4置信度: 32025-07-03

This paper presents a mechanistic interpretability analysis of the behavior of long chain-of-thought (CoT) capabilities. The core finding is that the long CoT capabilities are highly connected with a subset of activations in the last few layers of transformers. By amplifying this set of activations and inserting special tokens (“wait”), it is possible to elicit some long chain-of-thought capabilities, leading to certain performance improvements on math and steam questions. Based on this finding, the paper also proposes adaptation of LoRA, which only fine-tunes the last-few layers with LoRA, matching the performance of full finetuning or full layers lora findtuning.

优缺点分析

Strength

The paper makes a step towards mechanistic understanding of long chain-of-thought capabilities, providing both interpretation and intervention experiments that show some causal relation between these activations and model final behavior (measured by downstream performance).

Some findings are interesting, like “Instruct Model Activations are Very Inactive” compared to base and long cot models.

The elicitation method, especially the design of dynamic intervention factors is intuitive and shows some gains compared to constant intervention baseline.

The paper is mostly well-written and easy to follow.

Weakness

The observation that most activations differ in the last few layers may be closely related to the fact that these layers contribute heavily to the logits of final tokens. Since the positive and negative pairs are constructed from SFT and instruct models, the activation differences in the final layers could be attributed to the different token patterns between SFT-ed and instruct models. An alternative approach would be to construct positive-negative pairs using RL-trained models and the base models, which I wonder whether it might lead to more insights.

While the intervention methods offer some interpretability insights, their effectiveness may not match that of actual fine-tuning approaches. The parameter-efficient fine-tuning method trades some efficacy for some efficiency gains. The evaluation is conducted on LIMO (~800 examples), it is not sure whether using larger datasets would see a wider performance gap between full fine-tuning and the proposed method.

The experiment contains one model family (Qwen). It is unsure how the results could generalize across different model families like Llama.

(minor point / suggestion):

For Table 2 can you provide distilled models’ performance as references as well? It is not directly comparable but it sets a reference for the strength of intervention methods.

While figure 4 shows some clear qualitative trends for a few examples, it might be beneficial to show some quantitative results as well for measuring this behavior.

问题

See weakness

局限性

yes

最终评判理由

I thank authors for addressing some of clarity issues and adding more experiments. The work, especially the tuning part, is mainly about distillation from large to small models. most of the emergent reasoning capabilities, however, come from RL training, which the work does not heavily touch. Therefore I am increasing the score only to 4 (boarderline accept)

格式问题

N/A

作者回复

2025-07-31

Dear Reviewer

We sincerely appreciate your comments and feedback which helps to improve our paper. We reply to all the concerns raised in the weaknesses and questions part.

Weakness-1: The observation that most activations differ in the last few layers may be closely related to the fact that these layers contribute heavily to the logits of final tokens. Since the positive and negative pairs are constructed from SFT and instruct models, the activation differences in the final layers could be attributed to the different token patterns between SFT-ed and instruct models. An alternative approach would be to construct positive-negative pairs using RL-trained models and the base models, which I wonder whether it might lead to more insights.

Reply to Weakness-1: We sincerely appreciate the reviewer’s insightful observation. To explore this concern, we followed the reviewer’s suggestion and conducted the experiments that re-construct the positive-negative pairs using the generated outputs from RL-trained models and base models, respectively. Concretely, we use the same sampled data, follow the same rule to select the positive and negative samples, but from R1-distilled Qwen-7B and Qwen-7B respectively. We identified key long-CoT neurons via activation difference analysis (Δ-activation) in Qwen 2.5 7b base, and conducted the experiments on reasoning benchmarks using our training-free method. The result is as follows. It can be seen that using positive and negative pairs from R1 generated examples will also bring improvements to the model, which is a similar observation as we had in the paper. It indicates that the activation control is not just following the token pattern, but the actual long-cot reasoning paradigm.

Model Used	Accuracy	Length (word count)	Self-reflection
Qwen2.5 7B base model (GPQA baseline)	30.30%	457.34	4.04%
Qwen2.5 7B base model (GPQA intervention)	32.32%	971.93	40.40%
Qwen2.5 7B base model (AMC23 baseline)	45.00%	436.15	7.50%
Qwen2.5 7B base model (AMC23 intervention)	50.00%	498.32	67.50%

Weakness-2: While the intervention methods offer some interpretability insights, their effectiveness may not match that of actual fine-tuning approaches. The parameter-efficient fine-tuning method trades some efficacy for some efficiency gains. The evaluation is conducted on LIMO (~800 examples), it is not sure whether using larger datasets would see a wider performance gap between full fine-tuning and the proposed method.

Reply to Weakness-2: We first need to make the following clarifications:

We develop the intervention method to have an easy-to-use method with low computational cost, as the proof-of-concept of our motivation.
Then, we follow this motivation and devise a trainable method to achieve better performance. As it can train the parameters using more data, it is promising to perform much better. This design provides a trade-off between efficiency and performance, and users may choose the appropriate version depending on their deployment needs.

To address the reviewer’s concern about generalization to larger datasets, we extended our evaluation beyond LIMO by training our method on the Bespoke-Stratos-17k dataset (https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k) and evaluated on MATH500 and GPQA datasets. These results show that our trainable method achieves comparable or even better performance than LoRA-based finetuning, indicating the effectiveness of our approach.

Method	Math500	GPQA
Without FT	81.40 % (556.13)	51.01 % (598.23)
LoRA	90.20 % (1753.61)	57.58 % (1689.81)
Trainable Method (Ours)	90.60 % (1928.31)	58.59 % (1643.95)

Weakness-3: The experiment contains one model family (Qwen). It is unsure how the results could generalize across different model families like Llama.

Reply to Weakness-3: Thank you for the suggestion. We strongly agree that testing the generalizability on other models is important. To address this, we included an additional evaluation using the LLaMA-3.1 model in the Appendix of the original submission. Specifically, we applied our intervention method and evaluated it on Math500, a challenging benchmark requiring multi-step mathematical reasoning. This result has shown that our method can also lead to significant improvement in accuracy and self-reflection rate, indicating its effectiveness in eliciting the long chain-of-thought reasoning ability of LLMs.

Model Name	Accuracy	Length (word counts)	Self-reflection Rate
Llama-3.1-8B-base	9.8%	713.5	0.8%
Llama-3.1-8B-base+Ours	11.8%	783.09	38.40%

Weakness-4: For Table 2 can you provide distilled models’ performance as references as well? It is not directly comparable but it sets a reference for the strength of intervention method

Reply to Weakness-4: Thank you for the suggestion. We follow it by adding the results of R1-distilled Qwen 7B, and also applied our intervention method to it. We report the results on GPQA dataset, and we can also see a significant improvement on the distilled model. It further indicates the effectiveness of our approach. Despite the improvement on accuracy, our intervention further improves the reasoning length and self-reflection rate, demonstrating its effect on long cot ability controlling.

Model Name	Accuracy	Length (word counts)	Self-reflection Rate
R1 distilled Owen 7B	35.35%	2306.43	99.49%
R1 distilled Owen 7B + Ours	38.89%	2349.71	100.00%

Weakness-5: While figure 4 shows some clear qualitative trends for a few examples, it might be beneficial to show some quantitative results as well for measuring this behavior.

Reply to Weakness-5: Thank you for the valuable suggestion. Figure 4 is to qualitatively illustrate the contrast in activation values between reflective triggers (e.g., “wait”) and neutral tokens (e.g., “the”, digits). To show more quantitative results for measuring the behavior, we add new quantitative metrics to support the evidence. Concretely, we report (1) the average activation magnitude of long-CoT-related neurons when each token appears; (2) the proportion of selected neurons that exceed a threshold (e.g., activation > 4) for each token type. As shown in the following table, the “wait” token reliably elicits higher average activation and activates a greater fraction of long-CoT neurons compared to neutral tokens like “the” or mathematical operators like “equals”. This provides quantitative confirmation that certain tokens (e.g., “wait”) serve as interpretable triggers for switching the model into a reflective reasoning mode.

Token	Avg Activation	% > 4.0
Wait	0.816	5.3%
The	-0.1322	0.0%
Equals	0.116	2.0%

2025-08-06

Thank you for the response and added explanations. I will increase my score to the positive side.

评论- Thank you

2025-08-06

Dear Reviewer

Sincerely, Authors

评论- Reply to Reviewer

2025-08-08

Dear Reviewer

Thank you again for your time, thoughtful feedback, and engagement throughout the review process. We truly appreciate your acknowledgement that our additional experiments and clarifications addressed your concerns.

We noticed that the review score has not yet been updated, and we just wanted to send a gentle reminder in case it was an oversight.

Thank you again for your support and consideration.

Sincerely, The Authors

评论- Thank you

2025-08-09

Dear Reviewer

Thank you again for your support!

Sincerely, The Authors

评论- Kindly Notification of Approaching Discussion Due

2025-08-05

Dear Reviewer,

Thank you again for your initial review. We have carefully addressed your concerns in our rebuttal and provided detailed explanations along with additional experimental results to support our claims.

Best regards, The authors

评论- Thanks to all reviewers

2025-08-08

Dear Reviewers,

First and foremost, we would like to express our sincere gratitude to each of you for your thoughtful, constructive, and prompt feedback during this rebuttal period. Despite the short timeframe, all reviewers have actively participated in the discussion, and we are especially heartened by the positive and encouraging remarks many of you have shared. Your engagement has not only helped clarify key aspects of our work but also provided valuable insights to strengthen the paper further. We deeply appreciate your time and effort in contributing to a rigorous and productive review process.

Over the past few days, we have received many helpful and actionable suggestions. In response, we have conducted additional experiments and analyses to better support our claims and address the points raised. Below is a summary of the new experiments and improvements:

Comparisons to interpretability-based intervention baselines We implemented Full Activation Vector Transfer (a causal tracing-style method) using activation deltas from good vs. bad generations and evaluated the performance of patching these into a base model. We also performed random neuron amplification, showing that indiscriminate activation does not yield similar improvements as our targeted neuron intervention.
Token-level confounding ablation We tested the reflection-triggering effect of various tokens by measuring their impact on long-CoT-related neuron activations. The “Wait” token showed consistently stronger activation than neutral tokens like “the” or symbolic ones like “equals”.
Alternative token triggers and decoding behavior We explored the impact of inserting wait without adaptive scheduling to isolate the individual contribution of token triggering and showed it is not sufficient without corresponding activation modulation.
Decoding dynamics and quality vs. quantity of reflection We showed that wait-token insertion alone increases reflection frequency but not accuracy—only combined with activation amplification does it lead to better reasoning.
Ablations across scheduling mechanisms We included new results that contrast constant amplification, decay-based scheduling, and forcing + decay to clarify the role of dynamic modulation. These results help disentangle the effects of each design choice.
Performance and efficiency trade-offs We showed that our training method achieves competitive accuracy to full fine-tuning (e.g., 90.2% vs. 91.6% on Math500). It also outperforms LoRA on AMC23 and GPQA.

Planned Revisions to the Manuscript

We are committed to incorporating these experiments and feedback in the future version. Specifically, we will:

Revise Section 2 (Neuron Selection Process) to focus more on the methodology and move amplification factor ablation results (currently in Figures 2 & 3) into the main experimental section.
Update Table 1 to include the ablation baseline with adaptive scheduling only (no wait token), as requested.
Clarify terminology throughout the paper (e.g., consistently refer to “EELo-CoT” across all figures, including Figure 7).
Expand discussion in Section 5 to address confounding factors in decoding dynamics, and emphasize the synergistic effect between token-level triggers and activation-level interventions.
Add a short subsection in the appendix evaluating alternative reflection tokens (e.g., “rethink”, “hold on”, “let’s try again”) and discussing future directions for more generalizable triggers.
Include Table 3 in the main text and clarify its role in demonstrating efficiency-performance trade-offs.
Polish writing across figures and legends for consistency and clarity (e.g., relabeling axes, unifying method names, fixing minor typos).

In summary, we are grateful for the opportunity to engage in such a thoughtful review process. We believe that the improvements we’ve made—based on your feedback—substantially strengthen both the empirical and conceptual contributions of our work. We sincerely appreciate all the positive feedback and constructive suggestions that have guided these revisions.

Thank you very much

— The Authors

最终决定Accept (spotlight)

2025-09-17

Scientific claims and findings: This paper introduces a novel and efficient method for eliciting long chain-of-thought (CoT) reasoning in large language models (LLMs). The authors make the scientific claim that a small, identifiable set of activations in the final layers of an LLM governs its long-form reasoning capabilities, including self-reflection. Based on this insight, they propose two methods: a training-free inference-time intervention that amplifies these key activations and inserts "wait" tokens to trigger reflection, and a parameter-efficient fine-tuning (PEFT) approach that learns to control these activations.

Strengths :

Novelty and Insight: The paper offers a new, mechanistic perspective on CoT reasoning by investigating the role of internal model activations. This is a departure from common approaches that rely solely on data-driven fine-tuning.
Efficiency: The proposed training-free method is highly efficient, allowing for the enhancement of reasoning abilities without the need for costly retraining or fine-tuning. The PEFT variant also demonstrates superior or competitive performance with significantly fewer trainable parameters than standard LoRA.
Practicality and Strong Empirical Results: The methods are simple to implement and show consistent improvements in accuracy and self-reflection rates across several challenging reasoning benchmarks.

Weaknesses:

Mechanistic Depth: While the paper successfully identifies a correlation and demonstrates a causal link through intervention, the submission lacked a deep mechanistic explanation as to why these specific activations control reasoning.
Clarity and Exposition: Some key concepts, experimental details, and the distinction between different proposed methods were initially unclear.

** Reason for Accept (Spotlight)**: This paper introduces a conceptually novel and interpretable approach to a critical area of LLM research—improving complex reasoning. The core idea of controlling high-level cognitive behavior like CoT by manipulating a sparse set of internal activations is both novel and impactful. It opens up new avenues for efficient, targeted model control that are less opaque than standard fine-tuning. The simplicity of the intervention, combined with its effectiveness, is likely to inspire significant discussion and follow-up work in the community. The reviewers are all satisfied with the authors' responses and additional experiments and all give positive ratings. The authors should incorporate the promised edits in rebuttal into the revision.