PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
2
3
3
4
ICML 2025

Joint Localization and Activation Editing for Low-Resource Fine-Tuning

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

Activation editing approach that jointly optimizes the selection of intervention components and the intervention strategy in low-resource setting

摘要

关键词
activation editinglow-resourcelocalization

评审与讨论

审稿意见
2

The paper proposes JOLA, a novel parameter-efficient fine-tuning (PEFT) method that dynamically selects and edits the outputs of specific Transformer attention heads. JOLA jointly learns: (1) which heads to target, (2) the intervention type (additive, multiplicative, or both), and (3) the corresponding parameters (additive offsets/multiplicative scalings). Evaluated on three NLP tasks—commonsense reasoning, language understanding, and generation—JOLA outperforms baselines under low-resource settings, demonstrating improved stability and efficiency through its unified framework of localization and modular activation editing.

给作者的问题

  1. It is worth exploring why JoLA incorporates two additional gates while maintaining the same overall number of parameters as LOFIT. A thorough explanation of this architectural choice would enhance the reader's understanding of how JoLA optimally manages its parameters.
  2. In Figure 2, the performance of editing attention modules on the Physics task appears to be the lowest. An analysis of this outcome is warranted.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

No theoretical claims.

实验设计与分析

Yes, no issues.

补充材料

Yes, all parts.

与现有文献的关系

While the paper introduces JOLA as a novel PEFT method for joint localization and activation editing, there is insufficient clarity in distinguishing it from LOFIT, a closely related prior work that also intervenes on specific Transformer heads with additive/multiplicative editing. Both JOLA and LOFIT aim to (1) identify select heads for editing and (2) apply combined additive/multiplicative interventions. The paper does not clearly articulate what differentiates JOLA’s joint optimization framework from LOFIT’s approach, although the performance of JOLA is higher than LOFIT.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. The article has a clear and coherent structure with well-organized content.
  2. The article provides rich visual and analytical insights. The figures and tables effectively clarify complex concepts. The visualization of gate status provides valuable empirical evidence of JOLA's mechanism.
  3. JOLA outperforms baselines across multiple NLP tasks, particularly under low-resource settings. The results are well-documented, and the performance gains are consistent, supporting the method's efficacy.

Weaknesses:

  1. In Figure 1, you can draw a red box structure diagram of LoFIT like RED and JOLA.
  2. In Figure 1, LoFIT's Step 2, it's a "+", not a "x" sign.
  3. The article mentions "However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples." The PEFT method is a series of fine-tuning methods proposed to reduce computing resources. This turn of the article is to explain the shortcomings of PEFT, but the low-resource background is not a defect of PEFT itself. The logical relationship between this and the following text is not clear.
  4. I hope the article can provide a detailed comparative analysis between JOLA and LOFIT. From the current description of the article, LOFIT's m is learnable, but the maximum value of the value range is a hyperparameter and needs to be set manually. So, is LoFIT's m the product of JoLA's g and m? If LoFIT's m is 0, does it mean that the head is not selected? Similarly, is LoFIT's a the sum of JoLA's g and a? In addition, the article mentions that LoFIT determines the selected head by multiplication, but JOLA also uses multiplication to determine the selected head. What is the specific difference here? Or are the two stages of LOFIT not distributed as shown in Figure 1, because at present, the equation for synthesizing one stage is also acceptable.

其他意见或建议

  1. To enhance clarity in Figure 1, I suggest utilizing distinct colors for the boxes representing different methods. This would allow for easier differentiation and improve the overall visual appeal of the figure.
  2. In Table 2, it would be beneficial to bold the maximum value to draw attention to this important data point, thereby facilitating a quicker understanding of the results presented.
作者回复

Thank you for your insightful comments and suggestions

Q1: Presentation in Figure 1, Table 2.

A1: Thank you for your suggestion. We will make the necessary changes in the next version.

Q2: Relationship between PEFT?

A2: In the abstract, we introduced PEFT and its limitations in low-resource scenarios. Activation editing is a form of PEFT but updates fewer parameters compared to traditional methods like Adapter and LoRA. Traditional methods are less suitable for low-resource scenarios, and our approach aims to tackle this issue. We will clarify this logic in the next version to avoid any confusion.

Q3: Provide a detailed comparative analysis between JOLA and LOFIT.

A3: Below, we provide a comparison focusing on methodological differences, intervention strategies, and empirical performance.

(1) Methodological differences: Unlike LoFiT’s rigid two-stage pipeline, JoLA unifies localization and intervention into a single end-to-end framework. This allows dynamic adaptation to task requirements, avoiding suboptimal head selection caused by decoupled optimization.

LoFITJoLA
LocalizationTwo-stage process: 1. Select heads via learning with multiplicative interventions 2. Freeze selection and train additive interventions (bias vectors).Joint optimization: Dynamically selects heads while learning intervention parameters.
Intervention TypeAdditive only (bias vectors applied to selected heads).Hybrid intervention: Combines additive biases and multiplicative scaling via adaptive gating.
Sparsity ControlL1 regularization on scaling factors to select top-K heads.Hard Concrete gates with expected-L0 regularization, enabling probabilistic head pruning during training.
FlexibilityFixed intervention strategy (additive) after head selection.Learns task-specific intervention types (e.g., scaling vs. bias) per head.

(2) Formula-level comparison

LoFIT: zt(l,i)zt(l,i)+v(l,i)(Additive bias)z^{(l,i)}_t \leftarrow z^{(l,i)}_t + v^{(l,i)} \quad \text{(Additive bias)}

Limitation: Static additive edits may insufficiently adapt complex tasks requiring both amplification and suppression of features.

JoLA:

zt(l,i)=(1+gm(l,i)m(l,i))zt(l,i)+ga(l,i)a(l,i)z^{(l,i)}_t = (1 + g_m^{(l,i)} \cdot m^{(l,i)}) \odot z^{(l,i)}_t + g_a^{(l,i)} \cdot a^{(l,i)}

where, Scaling: (1+gm(l,i)m(l,i))(1 + g_m^{(l,i)} \cdot m^{(l,i)}), Bias: ga(l,i)a(l,i)g_a^{(l,i)} \cdot a^{(l,i)}

Advantage: The hybrid operation enables fine-grained control over activations—scaling amplifies/suppresses existing features, while biases shift representations.

(3) Empirical Performance

  • JoLA show robustness across 26 tasks under low-resource settings.
  • JoLA is more parameter efficient than LoFIT, we will update the parameter counting as discussed in Review 2Bjt [Q2] in the new version of the paper.

Q4: Parameter issues

A4: As noted in the original papers for each baseline method, these methods are sensitive to hyperparameters, and we detail this in Appendix D.3. The following table shows the sensitivity of these methods, with the results highlighting the lack of robustness, which motivates our proposal for a more stable approach.

Different learning rate in RED

SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
5e-548.4748.2528.0021.008.2712.74
2e-451.2350.8732.0018.0011.3314.53
6e-250.1647.1429.0023.0010.1413.59
Different prefix and suffix positions in ReFT
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
p7+s762.5153.9628.0040.0011.4412.57
p11+s1158.2451.4830.0041.0012.0211.35
Number of Attention Heads in LoFIT
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
3254.2152.3433.0010.0013.1817.27
6456.5254.5635.0012.0015.5616.34
12851.2655.4230.008.0014.6018.07

Q5: What happens on the physics task in Figure 2?

A5: Upon reviewing the generated text for the physics task in Figure 2, we found that in the low-resource setting, the generated outputs did not consistently follow the required format, which impacted the results. Specifically, the generated text lacked the necessary result options, leading to lower accuracy when evaluated using exact match criteria.

审稿意见
3

This paper introduces JOLA, a novel approach to efficiently adapt large language models in low-resource settings. The method jointly learns which attention heads to modify and determines optimal activation interventions using both additive and multiplicative adjustments. By incorporating a dynamic gating mechanism based on Hard-Concrete distributions with expected-L0 regularization, JOLA selectively edits a minimal set of components, thereby significantly reducing the number of trainable parameters. Extensive experiments on tasks spanning commonsense reasoning, natural language understanding, and language generation demonstrate that JOLA consistently outperforms existing methods such as LoRA and other activation editing techniques, achieving robust and scalable improvements even with limited training data.

给作者的问题

  1. How does the JOLA method perform when given larger resources? Can this activation-based editing approach be applied to larger datasets?
  2. The paper shows that the activation editing method performs best when applied to the Attention layer. Is this also the case for JOLA? Could the effectiveness of JOLA on the MLP layer be verified?

论据与证据

The claims proposed in the paper are substantiated by extensive experimental evidence, demonstrating that the JOLA method not only consistently outperforms baseline activation editing and parameter-efficient fine-tuning techniques in low-resource scenarios, but also significantly enhances performance across a diverse set of tasks including commonsense reasoning, natural language understanding, and generation. Moreover, the ablation studies provide clear and convincing evidence that the dynamic gating mechanism and selective attention head modifications are key factors driving these improvements, effectively validating the paper's central claims.

方法与评估标准

The Joint Localization and Activation Editing (JOLA) approach presented in this paper is soundly designed. The challenge of efficiently adapting large language models with scarce data stems from the difficulty in discerning which internal components yield the greatest impact. JOLA overcomes this by simultaneously determining the most influential attention mechanisms to modify and selecting the appropriate adjustment strategy—be it scaling, shifting, or a combination—while fine-tuning the corresponding parameters, thereby ensuring that the targeted modifications are optimally calibrated for low-resource environments.

理论论述

The paper does not include theoretical proofs.

实验设计与分析

The paper presents a thorough experimental design to evaluate the proposed Joint Localization and Activation Editing (JOLA) method for low-resource fine-tuning. In addition, the inclusion of extensive ablation studies (on gate mechanisms, the number of gates, and head selection strategies) and analyses across different data and model sizes strengthens the validity of their claims. The author mentioned that there is a huge difference in hyperparameter selection among the baseline methods. Could the extreme value variance for JORA and the other methods be displayed? Furthermore, providing a more detailed explanation of the statistical significance of the results and the reproducibility of the experiments would strengthen confidence in the research findings.

补充材料

Yes, the supplementary materials include experimental configuration, hyperparameters, detailed experimental results for different datasets, as well as detailed examples of dataset input and output.

与现有文献的关系

To improve low-resource adaptation of large language models, prior works have explored both parameter-efficient fine-tuning methods, such as LoRA and BitFit, and activation editing techniques like RED, REPE, and LoFIT. The JoLA method proposed in this paper innovatively combines dynamic attention head localization with joint optimization of additive and multiplicative interventions. Drawing on ideas from network pruning—specifically, HardConcrete gating and expected-L0 regularization—JoLA overcomes the static limitations of previous methods, resulting in a more adaptive and robust fine-tuning strategy that aligns well with contemporary advances in model adaptation.

遗漏的重要参考文献

na

其他优缺点

Strengths: Originality: JOLA introduces an innovative approach by jointly determining which attention heads to modify and optimizing both scaling and shifting interventions, thereby creatively combining insights from activation editing and network pruning. Significance: The extensive experiments across commonsense reasoning, language understanding, and generation tasks clearly demonstrate that JOLA substantially outperforms existing fine-tuning methods in low-resource settings, underscoring its potential impact on efficient model adaptation. Clarity: The paper is well-structured and offers detailed explanations of the dynamic gating mechanism and head selection strategy, with comprehensive ablation studies that effectively illustrate the contribution of each component.

Weaknesses: The approach’s reliance on specific gating mechanisms and parameter settings may introduce additional complexity and sensitivity to hyperparameter tuning, potentially affecting reproducibility.

其他意见或建议

na

作者回复

We appreciate your comments and the opportunity to clarify and improve our work.

Q1: Huge difference in hyperparameter selection among the baseline methods and JoLA. The reproducibility of our proposed JoLA

A1: As noted in the original papers for each baseline method, these methods are sensitive to hyperparameters, and we detail this in Appendix D.3. The following table shows the sensitivity of these methods, with the results highlighting the lack of robustness, which motivates our proposal for a more stable approach.

Different learning rate in RED

SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
5e-548.4748.2528.0021.008.2712.74
2e-451.2350.8732.0018.0011.3314.53
6e-250.1647.1429.0023.0010.1413.59
Different prefix and suffix positions in ReFT
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
p7+s762.5153.9628.0040.0011.4412.57
p11+s1158.2451.4830.0041.0012.0211.35
Number of Attention Heads in LoFIT
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
3254.2152.3433.0010.0013.1817.27
6456.5254.5635.0012.0015.5616.34
12851.2655.4230.008.0014.6018.07

Q2: Providing a more detailed explanation of the statistical significance of the results.

A2: Thank you for your suggestion. We will incorporate the statistical significance; note though that the gap is large We agree that conducting a statistical significance analysis is important to strengthen the credibility of our results. The significance test shows a meaningful difference between our method and the baseline methods.

Q3: How does the JoLA method perform when given larger resources?

A3: Thank you for your suggestion. We conducted experiments on larger datasets, and as shown in the table, our method remains effective with 5,000 and 10,000 samples. However, using 20,000 and 100,000 samples, there is a slight gap between our method and LoRA. These differences are acceptable, as we update fewer parameters compared to LoRA.

SIQAWinoGrande
JoLALoRAJoLALoRA
100074.7770.0474.3071.16
200075.2473.1274.9271.75
300075.4373.8575.3772.32
500075.8874.3575.9173.57
800075.9175.1476.2074.69
1000075.9675.6976.3175.92
2000076.0276.4876.5576.47
3000076.0876.9476.6877.24
5000076.1577.2376.8478.33
8000076.2177.5676.9278.52
10000076.2677.8177.0478.96

Q4: Could the effectiveness of JoLA on the MLP layer be verified?

A4: We also applied the gating mechanism-based strategy to the MLP layer. As shown in the following table, our experimental results demonstrate that the JoLA method is also effective in the MLP layer. However, it empirically tends to be more effective when used in the attention layer.

ReasoningUnderstandingGeneration
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
MLP w/o gate50.1051.6234.0020.0010.3114.45
MLP with gate52.4652.4336.0023.0011.2316.25
Attention w/o gate55.9455.3336.007.0014.7718.12
Attention with gate66.2258.3340.0046.0015.5424.39
Attention+MLP w/o gate52.1748.7423.0013.008.2312.36
Attention+MLP with gate53.2852.0727.0016.0010.4214.83
审稿意见
3

The paper proposes JORA, an interpretability-inspied parameter-effecient tuning methods. JOLA intervenes on the attention activations with both scaling and offsetting. In addition, JOLA uses HardConcrete gates with expected-L0 regularization to learns the localization together with intervention in an end to end way.

The paper conducts evaluation on multiple datasets and mainly compares against other activation editing baselines (ReFT and LoFiT). The results suggest the proposed approach is effective.

update after rebuttal

Thank you for your response. I've update the score.

给作者的问题

See Experimental Designs Or Analyses.

论据与证据

The claims are mostly valid. JoLA extends previous methods by jointly learning to localize and edit activations, which seems effective. The paper also provides abundant analysis.

Though I have some concerns in the way that the paper sets up the comparison with respect to the baselines, and the way that the paper discusses the number or parameters.

方法与评估标准

The proposed method jointly optimize for localization and editing, which is well-motivated and makes sense.

The paper also provides experiments across different model families, datasets, and model sizes

理论论述

N/A as there aren't theoretical claims in the main paper.

实验设计与分析

First, I have some concerns about the way that the paper sets up the baselines.

The Baseline results seem either under-tuned or too weak to compare with: In Table 1, a lot of the fine-tuning baselines underperform zero-shot (e.g. for Qwen2.5 7B on the reasoning tasks, none of the 6 baselines outperform the zero-shot baseline).

I wonder how the bae line is set up. The paper claims that baseline results were obtained by "selecting five hyperparameters and averaging the results." (ln 246) A more common approach is establishing the best results out of all the five hyperparameters as the baselines.

Second, I wonder how is the number of learnable parameters counted in Table 3. Is it only counting the gating variables, or is it counting the "activated" attention heads? If I understand correctly, JOLA incorporates two gating, one scaling, and one additive parameters for each attention heads. All these are learned end to end and need gradient back-propagation. Are all these being counted in the number of parameters?

补充材料

I did not review the supplementary material.

与现有文献的关系

The methods improves on previous methods by jointly localizing and editing the activations, which is learnt end-to-end with gating.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

The paper provides abundant ablations on the design choice of the method.

其他意见或建议

Some figures are hard to read, e.g., Figure 4 and Figure 8.

作者回复

Thank you for taking the time to review our paper and provide valuable insights.

Q1: Baseline setup seem either under-tuned or too weak to compare with?

A1: Thank you for your feedback on our baseline system setup.

(1) As noted in the original papers for each baseline method, these methods are sensitive to hyperparameters, and we detail this in Appendix D.3. The following table shows the sensitivity of these methods, with the results highlighting the lack of robustness, which motivates our proposal for a more stable approach.
Different learning rate in RED

SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
5e-548.4748.2528.0021.008.2712.74
2e-451.2350.8732.0018.0011.3314.53
6e-250.1647.1429.0023.0010.1413.59
Different prefix and suffix positions in ReFT
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
p7+s762.5153.9628.0040.0011.4412.57
p11+s1158.2451.4830.0041.0012.0211.35
Number of Attention Heads in LoFIT
SIQAWinoGrandeLawPhysicse2e_nlgWeb_nlg
3254.2152.3433.0010.0013.1817.27
6456.5254.5635.0012.0015.5616.34
12851.2655.4230.008.0014.6018.07

(2) Baseline methods require specific fine-tuning for each task, and we have 26 tasks to evaluate in our method. Fine-tuning all tasks individually is impractical due to the large hyperparameter search space. Therefore, we performed hyperparameter selection using a grid search approach. For each task, we ran a grid search with five different hyperparameter configurations, which were chosen to explore a diverse range of parameter settings that could provide the best model performance. We performed this search over key hyperparameters (e.g., learning rate, selected head/layers/positions etc. highlighted in Appendix D.3), using a validation set to select the configuration that resulted in the best performance. The results presented here correspond to the best hyperparameter configuration selected for each task based on the validation set performance. The final model was evaluated with these hyperparameters, and we averaged the results across all tasks. As observed, this had little effect on their relative performance, with our method continuing to outperform the others.

Table: best hyperparameter configuration in LLaMA

ReasoningUnderstandingGeneration
ACCACCBLEURouge-L
zero_shot53.7040.0012.5636.70
BitFit65.3736.1410.2332.59
RED50.2637.8612.7734.19
REPE66.0437.4311.4931.04
REFT67.1242.2913.0538.25
LOFIT57.7432.7113.1435.51
Our70.5547.0017.0740.65

(3) Given the small dataset (e.g., 200 samples in our setting), overfitting was a concern. To reduce overfitting's impact on the baseline, we used early stopping, which was not applied in the original implementation of the baseline systems. We also found that learning rate adjustment significantly affected the results. We evaluated four strategies, including linear schedule[1], Cyclic Learning Rate Schedule[2], Adaptive Heuristic Schedule[3] and Exponential Decay Schedule[4]. As shown in the following table, the exponential decay strategy proved most stable, so we used it for both the baseline and our method, as explained in Appendix D.1. The comparison of different learning rate decay strategies for JoLA and LoFIT is as follows. [1] Human-level control through deep reinforcement learning (Mnih et al., 2015) [2] Cyclical Learning Rates for Training Neural Networks (Smith et al., 2017) [3] A disciplined approach to neural network hyper-parameters (Smith et al., 2018) [4] An exponential learning rate sched- ule for deep learning (Li et al., 2019)

Different learning ratestrategies in JoLA

StrategySIQAWinoGrandelawphysicse2e_nlgweb_nlg
Linear62.7156.4938.0042.0014.0522.83
Cycle64.2557.2639.0043.0014.3723.44
Adaptive65.4758.6039.0044.0015.0223.86
Exponential66.2258.3340.0046.0015.5424.39
Different learning rate strategies in LoFIT
StrategySIQAWinoGrandelawphysicse2e_nlgweb_nlg
Linear54.1353.3635.006.0013.8416.95
Cycle54.3254.2534.006.0014.3717.83
Adaptive55.1855.5736.007.0015.2417.64
Exponential55.9455.3336.007.0014.7718.12

Q2: (same questions as Review QWX1 Review m3pE) parameter counting of the gating mechanism?

A2: We mentioned the parameter counting in Table 3 of Appendix A, using the same calculation method as ReFT and LoFIT, where trainable parameters are divided by the parameters of the base LLM. For example, we calculated the parameters for the SIQA task using the LLaMA-3-8B model. JoLA’s parameter count considered only the interventions. Interestingly, JoLA’s count matches LoFIT’s.

Q3: Some figures are hard to read, e.g., Figure 4 and Figure 8.

A3: Thank you for your feedback. Due to space constraints, we combined multiple images into one, which affected readability. In the next version, we will adjust the layout to improve clarity and readability.

审稿人评论

Thank you for the clarification on the hyper-parameters. Regarding the parameter counting, I feel this way of counting might not reveal the full picture.

I have raised my score accordingly.

作者评论

Thank you for the insightful feedback. We appreciate the opportunity to clarify our parameter counting in JoLA and address your concerns in detail.

1. Parameter Count Clarification
In JoLA, the trainable parameters include:

  • Multiplicative scaling vectors m(l,i)m^{(l,i)} and additive bias vectors a(l,i)a^{(l,i)} for every attention head.
  • HardConcrete gate parameters ϕm(l,i)\phi_{m}^{(l,i)} and ϕa(l,i)\phi_{a}^{(l,i)} for each head.

During training, all these parameters are updated via gradient descent. However, the L0L_0 regularization encourages most of the gate parameters to drive their corresponding gates toward zero, effectively "pruning" the majority of the heads. At inference time, only the heads with non-zero gate expectations contribute to the model’s computation—meaning only their m(l,i)m^{(l,i)} and a(l,i)a^{(l,i)} are applied.

2. Comparison to LOFIT
LOFIT pre-selects a fixed subset of attention heads in a two-step process: (1) Updating parameters for all heads; (2) Fine-tuning only the selected ones.
In contrast, JoLA continuously updates parameters across all heads during training. Thanks to the dynamic gating mechanism, the number of active parameters at inference is comparable to that in LOFIT.
For example, consider the LLaMA-3-8B model. The training parameters can be computed as: Ptrainable=Dattn×(Nmulti+Nadd+Ngate)PLLMsP_{\text{trainable}} = \frac{D_{\text{attn}} \times (N_{\text{multi}} + N_{\text{add}} + N_{\text{gate}})}{P_{\text{LLMs}}} Where:

  • DattnD_{\text{attn}} is the dimension of each attention head,
  • NmultiN_{\text{multi}}, NaddN_{\text{add}}, and NgateN_{\text{gate}} are the numbers of multiplicative, additive, and gating parameters, respectively.
  • PLLMsP_{\text{LLMs}} is the total number of parameters in the base LLM.

3. Simplified Calculation from Table 3:

  • LOFIT \text{Trainable Parameters} = \frac{128 \times (32 \times 32 + 32 \times 32)}{8,030,257,152} + \frac{128 \times (64 + 64)}{8,030,257,152} \approx 0.00003468481 \(\text{or } 0.003468481\%)
  • JoLA Trainable Parameters=128×(32×32+32×32+32×32+32×32)8,030,257,1520.00006528906 (or 0.006528906%)\text{Trainable Parameters} = \frac{128 \times (32 \times 32 + 32 \times 32 + 32 \times 32 + 32 \times 32)}{8,030,257,152} \approx 0.00006528906 \ (\text{or } 0.006528906\%) Where, 32×3232 \times 32 reflects that each of the 32 layers has 32 attention heads. 64 denotes the number of selected heads in LOFIT, which can optionally be set to 128.

JoLA and LOFIT maintain similar levels of activated parameters at inference. The small variations across tasks are expected: JoLA's activated heads vary dynamically by task. LOFIT's activated heads are determined by a fixed, manually set value.

We plan to update Table 3 in the next version to clearly distinguish between trainable and activated parameters as follows. We believe this will provide a clearer and more comprehensive understanding of our parameter counting methodology.

MethodTotal Params (%)Active Params (%)
LOFIT0.0034684810.0002
JoLA0.0065289060.0002

We hope this explanation addresses your concerns and helps clarify our approach.

审稿意见
4

This paper presents a novel extension of the activation editing approach. Its primary contribution lies in integrating localization and editing into a single process using a gating mechanism, unlike previous two-stage methods that first manually locate and then edit model components. This makes the proposed method more practical and adaptive. Extensive experiments on multiple LLMs demonstrate the effectiveness of this approach, particularly in low-resource scenarios, where it outperforms existing methods.

给作者的问题

  1. What exactly do the Hidden States and bias term in Fig. 2 refer to?

  2. Could the reason for multiple components performing worse in Fig. 2 be overfitting? Similarly, the authors' gating mechanism demonstrates that selectively editing attention heads is more effective than editing all heads, which is somewhat counterintuitive. Could this also be due to overfitting?

  3. By the end of training, how many attention heads are typically completely shut off?

  4. Since the gating mechanism is modeled as a distribution, how does its randomness manifest during inference?

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria make sense for the problem and application at hand:

  1. The use of a gating mechanism to jointly perform localization and editing is a reasonable approach.

  2. The experiments are well-controlled by ensuring a fair comparison of trainable parameters.

理论论述

NA

实验设计与分析

The soundness and validity of the experimental designs and analyses were checked.

补充材料

NA

与现有文献的关系

NA

遗漏的重要参考文献

I did not find any essential references that were missing from the discussion.

其他优缺点

NA

其他意见或建议

I find the description of the gating mechanism not entirely clear. Specifically, the parameterized form of the Hard-Concrete distribution is not explicitly detailed. The trainable parameters of the gating mechanism are not clearly stated. It would be helpful if the authors could clarify these aspects to provide a more precise understanding of how the gating mechanism operates and how it is optimized during training.

作者回复

Thank you for your feedback and suggestions.
Comment: The parameterized form of the hard-concrete distribution is not explicitly detailed. The trainable parameters of the gating mechanism are not clearly stated.
Response: The hard concrete distribution has two associated scalar parameters: a scale parameter and a temperature parameter. Following prior work on sparsification (e.g., Voita et al., 2019; Louizos et al., 2017), we train only the scale parameter and fix the temperature to 0.33. To clarify, these gates do not take any input – each gate is simply an instance of the hard concrete distribution with a single learnable parameter. We will clarify this in the new version of the paper.

  • Q1: What exactly do the hidden states and bias term in Fig. 2 refer to?
  • A1: Figure 2 illustrates the previously proposed forms of intervening in Transformer modules. The “Hidden states” approach follows REFT, applying interventions directly to the MLP hidden states. The “Bias” approach follows BitFit, modifying only the bias terms of attention, dropout, and layer norm activations. We will elaborate on these in Section 3.1 (currently lines 86–88) to clarify further.
  • Q2: Could the reason for multiple components performing worse in Fig. 2 be overfitting?
  • A2: We think it can be interpreted as a form of overfitting, though we do mitigate this using early stopping. More broadly, the results highlight that both deciding where we intervene (i.e., the choice and location of components) and having fewer interventions are important in our low-resource setting. As shown below, even when provided with more data (500 examples), the MLP+Attention intervention fails to match the performance of Attention-only using 200 examples across most datasets.

The same sample size as in Figure 2 (200)

SIQAWinoGrandeLawPhysicsE2E_NLGWEB_NLG
Attention55.9455.3336.007.0014.7718.12
MLP50.1051.6234.0020.0010.3114.45
Sample size (300)
SIQAWinoGrandeLawPhysicsE2E_NLGWEB_NLG
Attention56.3455.8536.0018.0015.2518.46
MLP53.0751.9335.0020.0010.6814.91
Sample size (500)
SIQAWinoGrandeLawPhysicsE2E_NLGWEB_NLG
Attention56.8656.3737.0020.0015.8718.85
MLP53.9252.4536.0021.0011.2215.26
Different Sample Size (Attention+MLP)
SamplesizeSIQAWinoGrandeLawPhysicsE2E_NLGWEB_NLG
20052.1748.7423.0013.008.2312.36
30052.8249.5125.0014.008.8512.74
50053.2450.3628.0016.009.3913.03
  • Q3: How many attention heads are typically completely shut off by the end of training?
  • A3: To clarify, we are not shutting down any attention heads. Instead, we selectively choose which heads to apply interventions to. When both the offset gate gag_a and the multiplicative update gate gmg_m are set to 00, the head’s computation becomes identical to that of the original model. By the end of training, most gates are closed – for example, on OBQA, 86% of attention heads have ga=0g_a = 0, and 94% have gm=0g_m = 0 (See Figure 8).
  • Q4: How does the gating mechanism's randomness manifest during inference?
  • A4: During training, we model each gating variable (e.g., ga(l,i)g_a^{(l,i)} and gm(l,i)g_m^{(l,i)}) as random, sampled from a hard-concrete distribution. However, during inference, we use their expected values, E[ga(l,i)]\mathbb{E}[g_a^{(l,i)}] and E[gm(l,i)]\mathbb{E}[g_m^{(l,i)}], instead of sampling from the distributions. This removes randomness, ensuring consistency and stability in inference. We briefly mention this in lines 165-166 and will clarify in the next version.
审稿人评论

I appreciate the authors' response and the additional experiments, which have addressed all of my concerns. Therefore, I have decided to raise my score. I hope the authors will include these clarifications from the rebuttal in the next version of the paper, as I believe they are important for understanding your work.

作者评论

Thank you for your valuable feedback and for raising the score. We are glad the additional experiments addressed your concerns. We will include the clarifications from the rebuttal in the next version of the paper for better clarity.

最终决定

This paper presents JoLA (Joint Localization and Activation Editing), a novel and practical method for parameter-efficient fine-tuning (PEFT) of large language models (LLMs) in low-resource settings. JoLA introduces a unified framework that simultaneously: 1). Identifies which attention heads to intervene on, 2). Learns the type of intervention (additive, multiplicative, or both), 3). Optimizes the associated intervention parameters. A core innovation lies in its use of Hard Concrete gating with expected-L0 regularization, allowing selective and sparse updates.

JoLA is a well-motivated, rigorously evaluated, and practically impactful method for PEFT in LLMs. The approach is both novel in design and strong in empirical performance [JoLA consistently outperforms baselines across 26 tasks and multiple model families, including LLaMA, Qwen, and Mistral, under low-resource settings (e.g., ~200 samples)], with substantial improvements over state-of-the-art baselines in low-resource settings. I recommend acceptance.