Learning Distribution-wise Control in Representation Space for Language Models
We propose learning distribution-wise control in the latent space of LM that is effective against existing PEFT and other intervention baselines.
摘要
评审与讨论
Post-rebuttal edit: the authors have provided detailed responses to my concerns during the discussion phase, and seem to have taken on board my concerns about the clarity of presentation. As a result, I'm happy to increase my score from 2 to 3. The reason why I haven't gone further is what I see as an important open question around the importance of test-time stochasticity. This is an issue the authors and I discuss at length in the comments below, culminating in some early empirical evidence that the current strategy of retaining stochasticity is actually detrimental. This suggests that the method may need to be slightly revised, which I believe should be within the scope of camera-ready edits.
This work proposes a novel method for intervening on the activations of language models to improve their performance on specific tasks. Rather than parameterising the intervention function as a deterministic linear model, as has been done before, the authors use the reparameterisation trick to learn Gaussian-distributed stochastic interventions. Experiments suggest that replacing deterministic interventions with stochastic ones, specifically at early model layers, substantially improves task performance.
给作者的问题
- Can you elaborate on your claim that "The objective of any intervention can be formalized as finding an optimal transformation that maximizes the mutual information between the modified representations and the desired output "? See my questions about this in the "Theoretical Claims" section.
- Can you clearly describe the process by which your intervention function is trained, starting with the training objective? A pseudocode algorithm might be useful here.
论据与证据
The essential claim is that the proposed method improves task performance better than the existing baselines. On the merits of the main results tables (Table 2 and Table 3) alone, it seems very strong in this respect. However, it is difficult to assess these results in isolation, because I remain uncertain about some aspects of the method itself. See below...
方法与评估标准
MAIN CONCERN: I found the method description rather incomplete and difficult to follow. The issues start with the first mention of the term "nodes" (line 58). What network do these nodes belong to? This might be obvious to the authors, but it is not standard terminology in the steering / representation fine-tuning literature. The biggest issue I have is the lack of a clear statement of the objective and algorithm used to train your intervention function. Section 3.3 suggests that the objective is something to do with the mutual information with some desired output, but it's unclear to me how this is computed or used in practice.
If the method description were to be substantially restructured and rewritten, I believe this could become a good paper, as it seems your results are very strong.
理论论述
One important theoretical claim is that "the objective of any intervention" is to maximise "the mutual information between modified representations and the desired output ", and that doing so is bound to "[improve] the model's predictive performance" (bottom of page 3, right-hand column). I have never encountered a claim of this kind before, and you provide no citations or theoretical analysis to back it up. Furthermore, it is unclear why this point is being made. How does the method make use of it? Is Equation (2) used as your objective when training the stochastic intervention function? If so, how is the mutual information computed, given that activations are high-dimensional vectors?
Separately, I feel the paper is missing a deeper theoretical (or at least intuitive) justification of why we should expect stochastic intervention functions to be so much better than deterministic ones. Is it that (as you allude to in line 90) the stochastic approach provides a "broader exploration [of interventions] during training"? If so:
- This needs to be emphasised more throughout the paper, as it appears to be a critical point.
- Why do you still need to keep stochasticity once the intervention function has already been trained? If the benefit is to facilitate better exploration and learning during training, why not just deterministically choose the mean intervention at test-time?
实验设计与分析
This is the strongest aspect of the paper. The experiments are exhaustive, covering a range of datasets, baselines and ablations.
However, the results figures and tables are missing the performance of the the base (non-intervened) models. I feel it would be very useful to include this, so the reader can understand how much benefit the various methods are giving.
补充材料
Appendix reviewed but Supplementary Material not reviewed.
与现有文献的关系
This paper lies within the growing literature on test-time language model alignment, more specifically the steering / representation engineering literature that involves intervening on model activations to induce desired behaviour. To my knowledge, the proposed idea of introducing stochasticity into these interventions is novel.
遗漏的重要参考文献
None.
其他优缺点
None; I've mentioned everything I feel is important elsewhere.
其他意见或建议
I suggest the authors add the performance of the non-intervened base models to all results figures and tables.
Other terminology and presentation points:
- I feel the terms "point-wise control" and "distribution-wise control" are poorly chosen, and added to my initial confusion about the method. They suggest that your method involves intervening on distributions of activations rather than one activation at a time, but this isn't the case. From my understanding, interventions are still made on a per-activation basis, but using a stochastic intervention function rather than a deterministic one. I therefore suggest that the terms "deterministic" and "stochastic" are much clearer, and should be used instead.
- There's a lot of vague language in the introduction (e.g. "low-level" and "high-level" control, "deeper, more abstract level", "modify model behavior in a finer-grained manner", "concept space"). Without definitions, this section all reads as quite imprecise.
- The use of the term "optimal" to describe your method (e.g. in the abstract) is too strong; your results look good, but there's no theory suggesting the method is optimal.
- I don't really understand why you've chosen to put a few (seemingly random) paragraphs in grey callout boxes.
Thank you for your insightful feedback on our submission! We’re encouraged by your recognition of our experimental results and value your suggestions for improving clarity and rigor. Below, we address your concerns and outline our revision plan.
About Main Concern: Method Clarity & Training Process
- Clarification: The MI formulation in Section 3.3 (Eq. 2) was intended as a high-level conceptual framing to motivate why interventions are useful in general (i.e., they aim to preserve/enhance task-relevant information) and in early layers (DPI in information theory in appendix A). It is not the objective function directly optimized during training. We agree with you that this will lead to confusion and will revise it in camera-ready.
- Actual Training Objective: Our method is trained end-to-end by minimizing the standard next-token prediction cross-entropy loss of the entire system (frozen base LM + trainable stochastic intervention layers between transformer blocks), identical to previous methods like RED/LoFiT/ReFT. Gradients flow from the loss, back through the reparameterized stochastic intervention networks (see psudo alg below).
pseudocode algorithm for learnable intervention
# for X_input, Y_target in training_batches:
# --- Forward Pass ---
# 1. forward pass before layer l
Z = LM_pre(X_input) # Activations at layer l
# 2. Predict intervention distribution params
mu = Net_mu(Z)
sigma = softplus(Net_sigma(Z)) # Ensure sigma > 0
# 3. Stochastic intervention via reparameterization
epsilon = sample_gaussian_noise(shape=mu.shape) # Sample N(0,I)
Z_intervened = mu + sigma * epsilon # Differentiable sampling
# 4. forward pass after layer l
Logits = LM_post(Z_intervened)
# --- Loss & Backward Pass ---
# 5. Calculate standard LM loss and backpropagate only on intervention network
We sincerely apologize for the lack of clarity in the method description and training process. This is a critical point, and we will revise the relevant sections (primarily Sections 3 and 4) to provide a much clearer explanation. We will make the following changes accordingly:
- Explicitly State Training Objective. We will dedicate space in Section 4 (Methodology) to explicitly state that the training objective. We will clearly define the overall system architecture (frozen LM + trainable intervention networks) and specify which parameters are updated during training.
- Add Reference with MI Discussion. We will relocate the MI-based motivation to a distinct subsection, referencing its grounding in variational information bottleneck (VIB) theory [1][2]. This will be clearly separated from the training objective to enhance readability and avoid confusion.
About Theoretical Claims & Justification for Stochasticity
- Key Motivation. Previous research found that the effect is an intervention can be controlled by multiplying a constant to control its magnitude (more positive or less positive). Then why don’t we directly learn the distribution to represent that region? Consequently, we use stochastic intervention to reflect this idea and find it very helpful.
- Key Benefits. The reviewer’s intuition is correct – a primary benefit is enhanced exploration of the intervention space during training. By sampling interventions from a learned distribution instead of applying a single deterministic transformation, the model is exposed to a wider range of related intervention effects.
- Potential Benefits of Test-Time Stochasticity. Test-time stochasticity might offer robustness to slight variations in input representations (See Figure 6 as D-ReFT is much more robust) or act as a cheap form of ensembling if multiple forward passes were considered.
About Terminology and Presentation
Thank you for your valuable feedback! We generally agree with the reviewer and will make the following changes:
- We will replace “point-wise/distribution-wise intervention” with “deterministic/stochastic intervention”.
- We will add non-intervened base model performance for Table 2 and Table 3.
- We will add concrete examples to illustrate “low/high-level control” and include a sentence illustrating “concept space.”
- We will replace “optimal” and other descriptive languages with more measured ones to maintain scientific neutrality.
Overall, we are sincerely grateful for the reviewer’s insightful feedback to help us improve our clarity. We believe that the updated presentation of methodology + the original experiment results will strengthen this work and contribute meaningfully to the intervention community.
Reference
- [1] Tishby, Naftali, and Noga Zaslavsky. "Deep learning and the information bottleneck principle." 2015 ieee information theory workshop (itw). Ieee, 2015.
- [2] Alemi, Alexander A., et al. "Deep variational information bottleneck." International Conference on Learning Representations (2017)
About Main Concern: Method Clarity & Training Process
Thank you for all your effort on this point; this has substantially improved my understanding of your method. Including all of this information in the paper will greatly improve it. I personally don't see the need to retain any of the MI discussion, but this perspective may appeal more to other readers.
About Theoretical Claims & Justification for Stochasticity
It seems we're in agreement that the clearest benefit of stochasticity will come at training-time. I also think your suggestion that D-ReFT could enable a cheap form of ensembling makes sense, although as far as I can tell, you don't explore this in the paper. We're left with your hypothesis that test-time stochasticity "might offer robustness to slight variations in input representations", and to be honest I don't quite follow this. How would adding noise at test-time improve robustness? Have you done any experiments where you disabled the stochasticity at test-time, and if so, did this actually give worse results?
About Terminology and Presentation
Thanks for acknowledging all these suggestions; I believe these targeted terminology changes will help a lot.
IMPORTANT: TYPO IN TITLE
I just spotted this; the word "Language" is misspelt in the title (both on OpenReview and in the PDF)! You can thank me later ;)
I'll wait until I get a response from you on the test-time stochasticity discussion before updating my review and making my final assessment.
Thank you for catching the typo! We notice it right after our submission - hopefully we can get a chance to fix it :)
Further discussion on Test-time Stochasticity
Previously we had a justification of potential benefits of it as we take it for granted that we learning from sampling, then just inference from sampling. Upon your follow-up, we are also curious about its results upon more experiment evidence. So we do a small runs of additional experiments without test-time stochasticity. Generally, we find that removing stochasticity in test-time can boost gain in robust-eval (%) and even in math setting () for Llama-3-8B. Though more experience needed to fully back up it as a scientific claim, we can still conclude that the need for test-time stochasticity should be better questioned. Thank you for pointing out such a good point!
We’re also happy to hear our first response was useful for you. A big and heartfelt thank you for your time, active engagement, and insightful feedback in this process!
The authors present a new parameter efficient finetuning approach D-ReFT. Whereas ReFT learns a deterministic (peculiarly parametrized) linear transformation of activations, D-ReFT instead learns a similarly peculiarly parameterized linear transformation that is stochastic. Specifically, they replace a part of ReFT with an axis-aligned normal variable, which parameters are learned with the reparametrization trick. The authors find that D-ReFT tends to outperform ReFT when applied to early layers, and thus a mixture of D-ReFT and ReFT where the first 25% of layers used D-ReFT whereas the rest used ReFT performed the best when finetuned on common sense reasoning and arithmetic benchmarks.
Update after rebuttal
I had concerns about the rigor of the experiments and found that the paper lacked key details about their methods. The authors provided more evidence and explained their methods in the rebuttal. I therefore updated my recommendation from 1 to 3.
给作者的问题
None
论据与证据
I suggest that the authors include error estimates in all of their results. Particularly when the improvements are so small, and when there are a great number of hyperparameters that can be tuned, it's hard to evaluate the soundness of the experiments.
I am confused by the setting in 6.2. It's unclear what 'varying epsilon from 0 to 3.0' means, since epsilon is a random variable. Suppose the authors meant the standard deviation of epsilon. But that would also make little sense, because since sigma is a learned variable, surely it can learn to rescale itself to compensate for changes in epsilon.
I am also finding it hard to interpret figure 4. The authors claim that accuracy and standard deviation is correlated, and choose to show this by plotting the pdfs of three different gaussians and their corresponding accuracies. Wouldn't the standard thing to do be to simply plot one graph of accuracy and std dev?
方法与评估标准
I'd like to see the instruction tuning evals as well. The reason is that instruction tuning, or at least tuning for tone/style, is a more likely use case for PEFT. The ReFT paper also did instruction tuning evals, so you should be able to compare with their numbers easily.
理论论述
N/A
实验设计与分析
Lack of error bars, as mentioned earlier.
补充材料
No.
与现有文献的关系
PEFT is commonly used for customization and for research.
遗漏的重要参考文献
None.
其他优缺点
Strengths:
- adding noise in training is an under-explored area for language models
Weaknesses:
- Results show slight improvements over baselines and don't have error bars to interpret the validity of the improvements
- Some parts of the paper are unclear
- 4.3: Are you referring to element-wise min/max? If so, why are you using element-wise min/max instead of something more natural like a matrix norm?
- entirety of 6.2 is confusing
其他意见或建议
pyvene is misspelled as pyenve in sec. 5.
We appreciate your thoughtful and constructive review of our manuscript. We’ve noted your main concerns regarding the statistical significance of our results and the need for clarification on our ablation choices. Thus, we try to address your feedback as below:
About eval setting & statistical significance
To begin, we’d like to clarify that our evaluation setup generally follows eval settings on what's already been established in prior work, such as RED, LoFiT, or ReFT. Consistent with the previous literature, we report the average score from three independent runs. We also agree with the reviewer that including additional metrics like standard deviation would be valuable—please refer to the table below for general results.
| Model | PEFT | Params (%) | Commonsense Avg. ↑ | Math Avg. ↑ |
|---|---|---|---|---|
| ChatGPT | --- | --- | 77.0 | 61.3 |
| LLaMA-7B | ReFT | 0.031% | ||
| LLaMA-7B | D-ReFT (Ours) | 0.046% | ||
| LLaMA-13B | ReFT | 0.025% | ||
| LLaMA-13B | D-ReFT (Ours) | 0.037% | ||
| Llama-2 7B | ReFT | 0.031% | ||
| Llama-2 7B | D-ReFT (Ours) | 0.046% | ||
| Llama-3 8B | ReFT | 0.026% | ||
| Llama-3 8B | D-ReFT (Ours) | 0.039% |
With the inclusion of variance metrics, our D-ReFT method demonstrates consistent and statistically meaningful improvements over the ReFT baseline across all tested models (pls also refer to Reviewer dkHP's rebuttal for additional results on Qwen/Gemma models). We’ll expand on the variance details in the revised paper to validate this point.
To further bolster reviewer’s confidence in our results, we’d like to highlight a few additional points:
- All reported numbers for prior work are fully optimized. We meticulously hypertuned all baseline methods to ensure a fair comparison (see Appendix B for details).
- For methods where replication proved challenging, we opted to directly cite the numbers reported in their original papers.
- Unlike some prior approaches that tuned parameters on the test set, we adhered to a separate dev-set hyperparam tuning process to maintain rigor.
We’re happy to provide further details if the reviewers feel additional transparency would strengthen the manuscript.
About ablation with epsilon
We acknowledge that the phrasing in our manuscript might have caused some confusion for the reviewer. To clarify, for each scaling factor , they effectively sample from , with the variance scaled accordingly. When the scaling factor is 0, this collapses to the original ReFT (deterministic).
The reviewer suggests that different scaling factors (or init) would ultimately result in the same learned variance. We respectfully disagree. Prior work on intervention methods like ReFT demonstrates that different initializations (LoReFT vs DiReFT) yield markedly distinct outcomes (different init cannot lead to rescale itself). Our experiments also show that introducing varying degrees of randomness at the initial learning stage leads to very different optimization trajectories.
On the presentation front, we appreciate the reviewer’s feedback regarding Figure 4. Initially, we did present the results as described (in a single figure with standard deviations). However, upon closer analysis, we find an interesting correlation between learned variance and accuracy. To emphasize this discovery, we split the data into two figures. We welcome your thoughts on this presentation choice and are open to reverting to the original single-figure format if preferred.
About results on instruction tuning
To address your inquiry about instruction tuning, we provide the following updated results.
| Model & PEFT | Params (%) | Win-rate (↑) |
|---|---|---|
| GPT-3.5 Turbo 1106* | — | |
| Llama-2 Chat 7B* | — | |
| Llama-2 7B & FT* | 100% | |
| Llama-2 7B & LoRA | 0.1245% | |
| Llama-2 7B & RED | 0.0039% | |
| Llama-2 7B & ReFT | 0.0039% | |
| Llama-2 7B & D-ReFT (Ours) | 0.0058% |
Number taken from ReFT. Three separate runs for each method.
Our proposed D-ReFT method achieves great performance with a win-rate of 87.19%, surpassing ReFT (85.27%) and other baseline. This superior performance demonstrating that controlled stochasticity during optimization leads to better generalization on instruction-following tasks.
Thank you for catching the typo. We will correct it and improve the paper writing thoroughly.
Thanks for adding the error estimates. The error estimates are enough for me to increase my score to a 2.
The main things preventing me from increasing my score further are:
-
Details with instruction tuning evaluations. Can you provide more details about your instruction tuning results? What model are you comparing the win-rate against?
-
Clarifying the questions I raised in my initial review, specifically:
- 4.3: Are you referring to element-wise min/max? If so, why are you using element-wise min/max instead of something more natural like a matrix norm?
- entirety of 6.2 is confusing
Thank you for your prompt response and kind follow-up! We aim to address your primary concern in the first round rebuttal, in this round we’d love to disclose further details to address your questions.
Details Setup in Instruction Tuning
We use Alpaca-Eval v1.0 for instruction tuning. By default, version 1.0 calculates the win rate against text-davinci-003, with GPT-4 serving as the judge. The prompt template is provided by Alpaca-Eval, and all models in the Alpaca-Eval benchmark use this template for evaluation. For training, we use UltraFeedback, a high-quality instruction-tuning dataset that covers various aspects like general IT knowledge, truthfulness, honesty, and helpfulness to assess model performance. This setup aligns with the previous work on RED and ReFT.
We adopt the recommended hyperparameter settings from the paper for baseline methods like ReFT. For D-ReFT, we didn’t have time to find the best hyperparameters, so we directly applied the params used in the math arithmetic learning datasets. All results are reported over three separate runs.
Model Clamping
Yes, your understanding is correct - it's element-wise clamp. Sadly its more of a historical artifact, as all the variational methods use element-wise clamping to avoid numerical issues. We agree with you that using a matrix norm would be more convenient. We ran some preliminary tests and found that they perform similarly, so we’ll adopt this approach in our codebase moving forward. Also, I think it’d be great if future work could explore learning a multivariate distribution with an L-norm clamp. Thanks!
Clarification on Section 6.2 ()
We want to kindly bring your attention to the first round rebuttal which answer in this question in section "About ablation with epsilon." Here, we’ll provide more detail.
Our goal was to dig deeper into where the gains come from. By ablating the scaling vector , we effectively sample from , which introduces scaled variance at the initial stage. Different values of —representing varying levels of randomness—lead to different outcomes (please see the first-round rebuttal for our discussion on differing views about init).
We designed this experiment with two questions in mind: i) What happens when we introduce different levels of randomness? ii) Where do the gains come from—could it be from effectively exploring the neighborhood region (is there a correlation between learned variance and performance)? Our findings are: i) Yes, a standard Gaussian seems sufficient. ii) Possibly—we observed some correlation between learned variance and accuracy.
Summary
Overall, we really appreciate your active engagement during the rebuttal period and the positive interaction. We hope this explanation sheds more light on our work, motivations, and design choices. Thank for taking time in this process!
This work deals with expanding point-wise representation engineering based interventions (ReFT) to be distribution-based ones (D-ReFT) by changing deterministic standard MLP layers to be stochastic via a reparametrization of the layer into two layers ( one for the mean and the other for the variance (+ gaussian noise ) of a distribution ). The paper also studies the accuracy of ReFT/D-ReFT at the layer level (over all layers of Llama based models ) to show early layers are more effective and then later ones, and that using D-ReFT gives gains of around 4% on average of the pointwise ReFT. They then show over commonsense and arithmetic benchmarks that using a mixed intervention approach ( 25% first layers using distributional and the last 75% using pointwise) leads to an optimal intervention.
给作者的问题
See claims section
论据与证据
Overall the claims in the paper are well presented and backed by the evidence presented in their experiment results which makes a convincing case for the drop-in replacement use of the distributional intervention method proposed for math and common sense reasoning tasks for the Llama family of models.
A few areas of improvement would be if the authors could:
-
baseline against LoFit, especially if any of those test results already exist, since the intervention setup is similar. Nit: Are Adapter^S and Adapter^P from the RED paper? If so you may want to preface those labels with RED. Also including Confidence Intervals on the D-ReFT numbers in Table 1, Table 2 and Table 3 should be done.
-
show results on “simple tasks” on top of the math/commonsense reasoning ( which were numerous and well done ). A point made in the paper is that the community needs to move past simple tasks, so presumably D-ReFT should make gains on such tasks as well, but at the moment the claims in the paper can only be made for math and common sense reasoning tasks.
-
spend more time explaining ReFT since much of the paper is based on extending that particular method to be distribution-wise. - If space is an issue, I’m not sure if the space for section 3 is really needed, or probably easier is to just not include the gray recap boxes provided. Additionally it was not immediately self evident to me how CE and MI are equivalent ( and that could be appendix )
-
Provide code and test results. The clamping mechanism mentioned in 4.3 is not shown in 4.2, and seems like it something that should go around the current right hand side of (7), but seeing code could confirm that.
-
NIT: For Figure 2 its a little hard to compare across the 4 graphs since the y axis for the ReFT graph is not aligned with the other 3. Also it’d be nice to either see SE bars across each point in the graphs to get a sense of variance at each of these.
-
For the 6.2 ablation are you all hard setting eps to values in [0,3] (increments of .2) or are you setting the eps to be N([0,3], I) ? If the former wouldn’t that make the experiment deterministic and not stochastic? Also if you are showing best results around eps=1, why not have your gaussian be based around 1 and not 0 ? I’m probably confusing something here.
-
Suggestion: Instead of having a separate Figure 4, why not just have another line graph in Figure 3 showing the variance value show at each value of eps?
方法与评估标准
The methods and evaluation criteria make sense with some small caveats.
理论论述
The theoretical claims are well introduced and backed up in the paper via empirical results and ablations.
实验设计与分析
The experimental design seems sound and valid though could be strengthened some ( adding LoFIT and Standard Errors,etc <-- see claims section )
补充材料
I went through the appendix which is helpful for giving hyper parameter values found and utilized for experiments and for understanding datasets used.
与现有文献的关系
The contributions of this work can have impact in the myriad of intervention based methods now present in the literature and in particular for representational engineering ones.
遗漏的重要参考文献
None
其他优缺点
See claims section
其他意见或建议
See claims section
We thank you for your detailed and constructive review of our manuscript with a positive assessment of our work. We are also grateful for your specific suggestions, which we believe will significantly strengthen the paper. Specifically, we address key feedbacks below:
About comparison with LoFiT
We agree with the reviewer that LoFiT is a significant contribution to the representation fine-tuning literature, as we highlighted in our introduction. LoFiT, ReFT, and our method share some same benchmarks—SIQA, ARC-c, and BoolQ (for commonsense reasoning), along with SVAMP (for math). Drawing from existing benchmark results and our own replication efforts, LoFiT and ReFT are tied across these four benchmarks (60.5 for ReFT, 60.7 for LoFiT, and 62.3 for D-ReFT). As for the remaining 11 benchmarks, we’re actively addressing them and plan to incorporate the results into the main body of our work to further solidify its standing in the intervention literature.
About testing on simple tasks
Thank you for your suggestion. Given time constraints, it may be challenging for us to run a fresh batch of experiments on these tasks. However, recent literature, such as RAVEL[1] and Axbench[2] reports scores for learnable intervention methods on simple tasks like entity recognition and concept detection. These results show that learnable interventions, such as ReFT, significantly outperform traditional methods like DiffMean[3] on these simple tasks, achieving over 90% accuracy in some settings. As a result, we may expect the intervention research community to shift toward tackling more complex and challenging tasks.
About the standard deviation
In this paper, we follow the evaluation settings of ReFT, LoFiT, and RED to report the average score across three different random seeds. Generally, the std for ReFT and D-ReFT is similar (avg. std. 2.71e-3 and 2.95e-3 for ReFT and D-ReFT, respectively) across settings.
About the scaling factor of epsilon
The ablation study in section 6.2 varies the scaling factor (from 0 to 3.0 with a step of 0.2) applied to , not the specific values of ϵ. For each scaling factor , they effectively sample from - still centered at 0 but with scaled variance. At scaling factor 0, this reduces to the original deterministic ReFT (no stochasticity). As they increase the scaling factor, they're increasing the variance of the noise distribution, which affects how widely the model explores the neighborhood around the learned mean. By doing this we could maintain the stochastic nature while controlling the amount of randomness.
About the presentation & code
Thank you for providing valuable suggestions regarding paper presentation! We will add a specific section in the appendix to introduce different intervention methods like ReFT to give readers outside this field better background information. We are also working on cleaning the codebase and will release all code and model checkpoints (approximately 70 for all ablations for one model) to make our research accessible to all users.
Reference
- [1] Huang, Jing, et al. "Ravel: Evaluating interpretability methods on disentangling language model representations." Association for Computational Linguistics (2024).
- [2] Wu, Zhengxuan, et al. "AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders." arXiv preprint arXiv:2501.17148 (2025).
- [3] Marks, Samuel, and Max Tegmark. "The geometry of truth: Emergent linear structure in large language model representations of true/false datasets." Conference of Language Modeling (2024).
The author suggests a generic methodology to replace deterministic interventions with distribution-level ones. Commonsense and arithmetic reasoning benchmarks on different Llama models are employed. When their method is used on early layers, the performance of tested tasks improves. When distribution-level intervention is applied on all layers, the performance degrades.
Update after rebuttal
I think the authors did a good job of engaging with the rebuttal/ critiques to improve the paper and were generous in providing new experiments/ results. My remaining concern is the lack of commentary on the high standard deviations + not providing more info regarding error bars/ statistical significance, however (which was really my key concern). Otherwise, I would've raised my score to 4 but now keep it as 3.
给作者的问题
- A major question is — to what extent is this "mixed-strategy-is-better" finding relevant to other language model families? Experiments on other decoder-only families like Qwen and Gemma, would be valuable see that your findings still hold?
- What's the motivation for clamping in 4.3? Please discuss the limitations of such an approach
- Are the results in Figure 2, and Tables 1 and 3 statistically significant? Can you include the standard deviation or the standard error of the mean?
- Great use of datasets, but why are not all methods benchmarked for arithmetic tasks (as in commonsense reasoning)?
- The robustness test - why would the "deletion of words" be a relevant evaluation strategy for testing the methods? A single token can hold crucial information in a particular prompt context, especially for arithmetic tasks. Please motivate this choice of evaluation strategy. (Can synonym replacement or any other "semantically similar replacement strategy" provide more reliable outcomes?)
- Your proposed methodology is presented as a generic method (D-MLP ... D-ReFT), so why are only D-ReFT benchmarking results reported in the tables?
论据与证据
The paper is generally clear and ...
- The idea of replacing deterministic nodes with probabilistic ones is theoretically valid/ bringing uncertainty to interventions seems useful and complementary to existing efforts in the field
- The notation and theory in Sections 3 and 4 are generally well-written and easy to follow
- The experiments are substantial (number of datasets/ tasks) and generally well-described
... when it comes to evidence, I have a couple of remarks
- The paper extends their claims to "language models", but runs experiments on only one model family. Either, additional model families (e.g., Gemma, Qwen) could be evaluated, or claims should be scoped/ lowered to Llama models
- The proposed method appears very sensitive to layer choice (i.e., where to apply the distribution-level intervention) but this is not communicated properly — the introduction highlights best-case gains (+4% to +6%) but does not sufficiently communicate sensitivity to layer choice (by looking at Figure 2, model degradation is likely!)
- The proposed method introduces many hyperparameters (Section 6.3) which is a practical drawback for anyone using their method. The authors promise "detailed values" about these hyperparameters in Appendix B — yet many formulations in Appendix B are limited to phrasings like “works best” (see B.1 and B.2) which needs to be improved to better understand, as a reader, how hard it would be for a new user/ practitioner to apply their method for their use case
- Phrasings such as "significantly higher", "significantly improve" are used across the paper but no statistical significance analysis, error bars, or confidence intervals supports this
方法与评估标准
- In the formulation "all test samples" (p.6), it is unclear which dataset is referred to
Otherwise, see "Claims and Evidence" response.
理论论述
Yes, Sections 3 and 4 are clear and seem correct.
实验设计与分析
See above.
补充材料
Yes, I reviewed most sections in the supplementary material-
与现有文献的关系
Yes, the authors provide a satisfactory Related Works section.
遗漏的重要参考文献
N/A
其他优缺点
The paper introduces an interesting and theoretically valid intervention strategy, so I would ideally like to accept the paper, but currently, I am borderline (weak accept). Some empirical claims seem a bit overstated — so either lower them or return additional empirical evidence in the rebuttal (as asked for in the different questions above and below). Also, a more clear motivation for certain design choices (such as clamping, and the use of distribution-level only on lower layers + if this translates to other model families) would strengthen the paper.
其他意见或建议
- Consider moving interpretation insights from Section 7.1 into the introduction (currently, the mixed strategy is introduced in the Introduction without sufficient context)
- The term "robustness" in the abstract and introduction is unclear what it is referring to
- Figure 4 presents correlation information in a non-standard way — while it states variables are correlated, the distribution is centered around zero?
- In Section 7, percentage notation is confusing, explicit layer indices would remove ambiguity and make it easier to read
Thank you for your constructive feedback! We sincerely appreciate the time and effort you dedicated to reviewing our paper and providing thoughtful comments. We have carefully reviewed your concerns and questions and recognize that your primary focus is the generalization to other series models and motivation of experiment design. Below, we address each point in detail.
About model diversity
We appreciate the feedback regarding our claims' generalizability beyond Llama models. Previously we followed the setting in ReFT for better comparison, but we realize the limitation of this design. In response, we have conducted additional experiments on Qwen and Gemma models on arithmetic reasoning tasks.
For layer-wise setting
Qwen-2.5-7B (28 layers)
| Intervention | |||||
|---|---|---|---|---|---|
| ReFT | 79.2 | 78.5 | 78.1 | 75.7 | 74.6 |
| D-ReFT (Ours) | 80.4 | 80.1 | 78.3 | 76.5 | 73.7 |
Gemma-3-12B (42 layers)
| Intervention | |||||
|---|---|---|---|---|---|
| ReFT | 80.7 | 82.4 | 78.7 | 78.2 | 76.6 |
| D-ReFT (Ours) | 83.4 | 83.0 | 80.1 | 79.2 | 76.7 |
For all-layer setting:
| Model | ReFT | D-ReFT (25%) | D-ReFT (50%) | D-ReFT (75%) | D-ReFT (100%) |
|---|---|---|---|---|---|
| Qwen-2.5-7B | 85.7 | 87.1 | 86.2 | 85.7 | 85.1 |
| Gemma-3-12B | 88.2 | 90.6 | 89.6 | 87.1 | 86.1 |
For Qwen-2.5-7B, it’s the top 8 layers. For Gemma-3-12B, it’s the top 11 layers.
These new findings back up our original observations (early layers perform better + a mixed strategy works well) and show they apply across different model types. We’ll update the main paper by adding these results to Table 2 and Table 3, strengthening our overall analysis.
About the motivation for clamping
The key motivation for implementing clamping is to prevent numerical instability issues that may arise from introducing large stochasticity during training (similar to variational methods[1][2]). Regarding potential limitations, theoretically, the latent distribution we learn will deviate slightly from the standard Gaussian due to clamping. We also find without clamping performance is slightly better, but not too much as the extreme values may occur very rarely during sampling.
About the setting in robustness eval
During our preliminary studies, we have tried both synonym replacement (using WordNet) and paraphrase generation (using back translation). However, our empirical analysis revealed that these semantically-preserving transformations produced insufficient perturbation magnitude to effectively discriminate between intervention methodologies, so we implemented a more challenging delete-N setting to provide stronger attack.
About the standard deviation
Generally, the std for ReFT and D-ReFT is similar (avg. std. 2.71e-3 and 2.95e-3 for ReFT and D-ReFT, respectively) across settings.
About the writing and presentation
We acknowledge the reviewer’s valuable feedback concerning several aspects of the paper’s writing and presentation that require improvement. We plan to incorporate those suggestions and revise claims to accurately reflect the scope of experimental validation (model families tested) and ensure terms like "significant" are moderated.
Reference
-
[1] Alemi, Alexander A., et al. "Deep variational information bottleneck." International Conference on Learning Representations, 2017.
-
[2] Zhu, Zhiyu, et al. "Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability." International Conference on Learning Representations, 2025
Thank you for providing the additional experiments!
Some follow-up questions on the rebuttal:
- Q1. In the new results table, why are certain values bold-faced? Is this to indicate best performance per row, or something else?
- Q2. Can you please summarise how delete-N works? I'm still not completely convinced about the validity of this evaluation approach.
- Q3. You provide averaged stds, but it is still unclear to me if the improvements with D-ReFT (Figure 2, and Tables 1 and 3) are statistically significant? Will the final version include standard deviations for the tables and error bars for the figures? Results from statistical significance tests?
- Q4. I'm also missing a discussion on the drawbacks of the many hyperparameter that the method needs tuning and the general sensitivity of the method, with respect to these parameters. I raised this as a question already, i.e., for more details in Appendix B, but I don't see a repsonse to it in the first rebuttal. I'd like that the relative difficulty for a new user/ practitioner to apply your method for their own dataset to be properly communicated in the paper.
- Q5. Lastly, maybe I missed this, but I don't understand why only D-ReFT is benchmarked while the proposed methodology is presented as a generic method (D-MLP ... D-ReFT)? It is totally fine to limit the experiments at some point, but I'd like to understand why no other "D" extensions were explored?
With these answered, I'd be happy to raise my score!
Thank you for your kind feedback and active follow-up! In the previous round we may be hard to cover all the questions and details due to space contrainsts. In this response we'd happy to disclose more discussions on it.
Q1. In the new results table, why are certain values bold-faced? Is this to indicate best performance per row, or something else?
Yes, your understanding is correct - we highlight some numbers that have best performance in terms of layer choice and mixed strategy choice. The msg we want to convey is that we can still see previous insights holds across difference arch.
Q2. Can you please summarise how delete-N works? I'm still not completely convinced about the validity of this evaluation approach.
Thank you for raising this point. We believe reviewers already grasp the basic setup of delete-N, so we’ll focus on why it works—particularly addressing concerns about how a single token can carry critical information. In our arithmetic benchmark, the average sequence length is around 40–50 tokens. When we delete only a small portion (fewer than 5 tokens), the setup resembles something like: "A [MASK] building needed 12 windows. The builder had already [MASK] 6 of them. If it takes 4 hours [MASK] install [MASK] window, how long will [MASK] take him to install the rest?." Also, in practice, we do not delete any numeric tokens, so the scenario mentioned by the reviewers should be mitigated.
To further strengthen this evaluation, one option could be to include additional perturbation results—such as synonym replacement/paraphrasing-into this section. We hope this clarification addresses your concern effectively.
Q3. You provide averaged stds, but it is still unclear to me if the improvements with D-ReFT (Figure 2, and Tables 1 and 3) are statistically significant? ...
We’ll start by providing more details for Tables 1 and 3 here:
| Model | PEFT | Params (%) | Commonsense Avg. ↑ | Math Avg. ↑ |
|---|---|---|---|---|
| ChatGPT | --- | --- | ||
| LLaMA-7B | ReFT | 0.031% | ||
| LLaMA-7B | D-ReFT (Ours) | 0.046% | ||
| LLaMA-13B | ReFT | 0.025% | ||
| LLaMA-13B | D-ReFT (Ours) | 0.037% | ||
| Llama-2 7B | ReFT | 0.031% | ||
| Llama-2 7B | D-ReFT (Ours) | 0.046% | ||
| Llama-3 8B | ReFT | 0.026% | ||
| Llama-3 8B | D-ReFT (Ours) | 0.039% |
In the previous manuscript, we reported only the average scores from three separate runs, following common practice in the literature. However, we recognize that including statistical significance results would be highly beneficial. To address your questions directly: yes, we will include all the details in the final version. The figures will feature error bars, and the tables will provide standard deviations along with all runs.
Q4. I'm also missing a discussion on the drawbacks of the many hyperparameter that the method needs tuning and the general sensitivity of the method, with respect to these parameters... (parameter difficulty)
As D-intervention is indeed a generic method (as reviewer noted in Q5), the additional hyperparameters boil down to just the choice of , and we’ve conducted a thorough ablation study on this. Practitioners can simply use a standard Gaussian without needing further tuning. For authors looking to apply D-ReFT, we suggest sticking with our recommended settings in Appendix B.
Q5. Lastly, maybe I missed this, but I don't understand why only D-ReFT is benchmarked while the proposed methodology is presented as a generic method (D-MLP ... D-ReFT)? ...
Yes! We believe this generic framework is exactly what makes its future potential exciting. As for the choice in this work, the reasoning is fairly straightforward: we wanted to explore how far we could push the approach by applying it to a powerful deterministic version. Since ReFT is the strongest deterministic intervention method to date, D-ReFT represents the best intervention we could achieve currently.
Thank you once again for dedicating your time to this process! This type of generous and genuine discussion is undoubtedly what every author looks for. While we may not have the opportunity for further back-and-forth, we hope this exchange has provided the reviewer with better clarity and also strengthen our work!
This work proposes and evaluates a (standard reparameterization-based) method as an alternative to more typical methods for steering language model representations. The experimental evaluations are thorough, especially after the revisions, and the paper is clearly presented. Given the growing interest in this topic, this work seems potentially impactful. While there are some lingering questions (e.g. from reviewer DgJz about the test-time stochasticity), they should be addressable in the final version.