PaperHub
7.5
/10
Oral4 位审稿人
最低6最高8标准差0.9
8
8
6
8
2.5
置信度
正确性3.3
贡献度3.3
表达3.0
ICLR 2025

When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers

OpenReviewPDF
提交: 2024-09-27更新: 2025-05-18
TL;DR

We provide the first theoretical characterization of the generalization guarantees of task vector methods on nonlinear Transformers.

摘要

关键词
Task arithmeticgeneralizationnonlinear Transformersdeep learning theorymachine unlearning

评审与讨论

审稿意见
8

The paper produces the first theoretical result on the ability of task vectors to generalize to new tasks and ability to unlearn a task. The study is conducted under the lens of "aligned" vs "irrelevant" vs "contradictory" tasks. The authors studies the necessary fine-tuning hyperparameters and the task vector coefficients (alphaalpha) that enable generalization and unlearning.

Additionally, the authors also verify the theory result with experiments on toy CMNIST dataset and on next token prediction task.

优点

  • Very exciting theoretical result that includes very practical aspects (e.g., hyperparameters of the fine-tuning, how to set alpha for each task vectors, etc)
  • Novel characterization of generalization conditions on 1 layer transformer, building up on previous work that uses NTK assumptions
  • Practical scenarios in terms of the relation between tasks (aligned vs irrelevant vs contradictory)
  • Nicely written and relatively accessible to non-theory people. I particularly like the remark after each theorem that explains what the theory is and what does it imply
  • Nice setup on CMNIST that reflects the aligned vs. irrelevant vs. contradictory condition, followed by a nice non-toy experiment.

缺点

N/A -- good paper overall :)

问题

Definition 2 -- what is the dimension of muτmu_{\tau}? is it the same as vv?

评论

We appreciate your valuable time for the evaluation.

Q1 (Questions): what is the dimension of μ_T\mu\_{\mathcal{T}}? is it the same as vv?

A1: The dimension of μ_T\mu\_{\mathcal{T}} is dd. Yes, it is the same dimension as vv. Thank you for pointing this out. We have added μ_TRd\mu\_{\mathcal{T}}\in\mathcal{R}^d at the beginning of Definition 2.

审稿意见
8

This paper provides a theoretical analysis of the effectiveness of task vector methods for model editing in transformers. The authors investigate the conditions under which task addition (for multi-task learning) and task negation (for unlearning) are effective, proving that task correlation plays a crucial role. They also establish conditions for successful out-of-domain generalization using task vectors. Experiments with both synthetic and real-world data validate the key concepts of the proposed theory.

优点

  • This paper is the first to theoretically examine the generalization of task vectors, filling a significant gap in current research.
  • Writing is well-written and easy to follow.
  • The theoretical contributions are well-supported by experiments, effectively bridging theory and empirical validation.
  • The theoretical insights align well with intuitive expectations regarding the effects of task vectors across aligned, irrelevant, and contradictory tasks.

缺点

  • Although insightful, the data model may be overly simplistic for capturing the complexities of real-world data. For example, even for simple yes/no questions, a negation word in a sentence may flip the relevance of certain words in the sentence, which cannot be captured by the proposed data model. I wonder if some theoretical aspects can be generalized independently of this data model.
  • The analysis is restricted to a one-layer transformer with limited nonlinearity, despite claims in the title and introduction regarding the challenges of analyzing nonlinearity in task vectors.

问题

I found this paper highly engaging and believe it would attract even greater interest with an exploration of the theory across a broader set of tasks, particularly generative tasks. For example, could the proposed theoretical framework be extended to multiclass classification as an initial step? A discussion on how these insights might be applied to a wider range of tasks would substantially enhance the paper's appeal. I would be willing to increase my score if the authors could provide even preliminary ideas on these extensions.

评论

We thank you for your valuable time and efforts in the evaluation. We have added Appendix D to discuss the extension to the multi-classification case.

Q1 (Weakness 1): Although insightful, the data model may be overly simplistic for capturing the complexities of real-world data. A negation word in a sentence may flip the relevance of certain words in the sentence, which cannot be captured by the proposed data model.

A1: This is a good question. First, our data formulation follows the state-of-the-art theoretical study of model training on vision data [Deng et al., 2023; Bu et al., 2024; Jiang et al., 2024] published at top ML conferences ICML, ICLR, and Neurips. For example, Section 2.2 of [Deng et al., 2023], Section 2.1 of [Bu et al., 2024], and Section 3 of [Jiang et al., 2024] consider opposite key patterns for opposite labels, respectively. The data formulation we use is consistent with the existing theoretical works. The simplified data formulation allows us to characterize the gradient updates in different directions for different patterns in the convergence and generalization analysis. Meanwhile, to the best of our knowledge, the mathematical formulation of the state-of-the-art training and generalization analyses of Transformers learning language data [Li et al., 2023b; Chen et al., 2024] do not explicitly consider the impact of negative words either. These works establish the data model as Markov Chain, N-Gram model, etc., which also consider input language tokens as orthonormal vectors like our work.

Second, our formulation considers word embeddings as the input instead of the raw data. In the embedding space, a negative word may map the contexts to the negative direction of the same contexts without the negative word. Based on such embeddings, our analysis can be applied.

Deng et al., 2023. Robust Learning with Progressive Data Expansion Against Spurious Correlation. At Neurips.

Bu et al., 2024. Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples. At ICML.

Jiang et al., 2024. Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization. At Neurips.

Chen et al., 2024. Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers. Preprint.

Li et al., 2023b. How do transformers learn topic structure: Towards a mechanistic understanding. At ICML.

评论

Q2 (Weakness 2): The analysis is restricted to a one-layer transformer with limited nonlinearity, despite claims in the title and introduction regarding the challenges of analyzing nonlinearity in task vectors.

A2: Thank you for the question. First, the motivation of our analysis emphasizes the contribution of analyzing the nonlinear transforms to distinguish from the existing approach that analyzes task vectors in the tangent space which requires linearization of the transformer models [Ortiz-Jimenez et al., 2023; Wu et al., 2023]. Note that [Ortiz-Jimenez et al., 2023] use the Neural Tangent Kernel (NTK) framework (Jacot et al., 2018) to study neural networks as linearized models under specific assumptions. Under this linearized setting, the use of linear arithmetic on task vectors for targeted model editing can be justified. In contrast, our work first characterizes the learning dynamics of Transformers out of the NTK region Although our model is only one-layer Transformer, our analysis of the learning dynamics does not require linearization as the existing work, thus, we emphasize we investigate task arithmetic using nonlinear Transformers.

Second, regarding the simplification of the model, the state-of-the-art theoretical works [Li et al., 2023; Jiang et al., 2024; Makkuva et al., 2024] recently published at top ML conferences like ICLR and ICML on the theoretical generalization and/or learning dynamics of Transformers also focus on one-layer single-head Transformers only. That is because the loss landscape for multi-layer/head Transformers is highly nonlinear and non-convex due to the interactions between multiple nonlinear functions. The extension to multi-layer multi-head Transformers requires a more complicated characterization of the gradient updates, while the simplified model architecture can benefit the theoretical analysis as mentioned by the reviewer. For the theoretical generalization guarantees of Transformers beyond one-layer single-head, to the best of our knowledge, only [Tian et al., 2023; Chen et al., 2024a; Chen et al., 2024b] theoretically study learning with multi-head or multi-layer Transformers. To the best of our knowledge, no existing works theoretically study the generalization of merging multiple task vectors, even with one-layer single-head Transformers. Therefore, we focus on the one-layer analysis to obtain some initial theoretical insights. The theoretical findings are justified with multi-layer multi-head real-world Transformer architectures ViT-Small/16 and Phi-1.5, respectively. The results are consistent with the findings in Theorems 1, 2, and 3 using one-layer single-head Transformers.

Ortiz-Jimenez et al., 2023. Task arithmetic in the tangent space: Improved editing of pre-trained models. At Neurips. Wu et al., 2023. On the Convergence of Encoder-only Shallow Transformers. At Neurips.

Li et al., 2023. A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity. At ICLR.

Jiang et al., 2024. Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization. At Neurips.

Tian et al., 2023. Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer. At Neurips.

Makkuva et al., 2024. Local to Global: Learning Dynamics and Effect of Initialization for Transformers. At Neurips.

Chen et al., 2024a. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. Preprint.

Chen et al., 2024b. Provably learning a multi-head attention layer. Preprint.

评论

Q3 (Question 1): This paper would attract greater interest with an exploration of the theory across a broader set of tasks, particularly generative tasks. Could the proposed theoretical framework be extended to multiclass classification as an initial step?

A3: This is a great question. We understand that the reviewer expects a more general setup of multi-classification. The reason why we only discuss binary classification in the main body of our paper is that all the existing theoretical works studying the optimization and generalization of Transformers [Jelassi et al., 2022; Li et al., 2023; Jiang et al., 2024] only consider binary classification.

Our theoretical analysis for binary classification can be extended to multi-classification. We have added a related discussion in Section D in Appendix. Briefly speaking, one can define that a 2c2^c-classification is achieved by cc times of binary classification with the orthonormal set μ_T(1),,μ_T(c)\\{\mu\_\mathcal{T}^{(1)}, \cdots, \mu\_\mathcal{T}^{(c)}\\} as the discriminative patterns for the task T\mathcal{T}. The label is cc-dimensional with each entry chosen from +1,1\\{+1,-1\\}. We then need cc sets of our binary model (equation 4) to generate the output for 2c2^c-classification. The learning process is then cc independent binary classification problems for each entry of the cc-dimensional output. After fine-tuning, the trained model of each output entry has a similar property to Lemma 1 for binary classification. Then, for the disrminative patterns μ_T_1(i)_i=1c\\{\mu\_{\mathcal{T}\_1}^{(i)}\\}\_{i=1}^c of task T_1\mathcal{T}\_1 and μT_2(i)_i=1c\\{\mu_{\mathcal{T}\_2}^{(i)}\\}\_{i=1}^c of task T_2\mathcal{T}\_2, if for any μT_1(i)\mu_{\mathcal{T}\_1}^{(i)}, there exists a unique μ_T_2(i)\mu\_{\mathcal{T}\_2}^{(i)} close to be orthogonal to μ_T_1(i)\mu\_{\mathcal{T}\_1}^{(i)}, then T_1\mathcal{T}\_1 and T_2\mathcal{T}\_2 are irrelevant. If for any μT_1(i)\mu_{\mathcal{T}\_1}^{(i)}, there exists a unique μT_2(i)\mu_{\mathcal{T}\_2}^{(i)} with a small angle to (or almost opposite to) μ_T_1(i)\mu\_{\mathcal{T}\_1}^{(i)}, then T_1\mathcal{T}\_1 and T_2\mathcal{T}\_2 are aligned (or contradictory). We can then derive similar conclusions as our Theorems 1 and 2 by combining the results of all the output entries.

Jelassi et al., 2022. Vision transformers provably learn spatial structure. At Neurips.

Li et al., 2023. A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity. At ICLR.

Jiang et al., 2024. Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization. At Neurips.

评论

Thank you to the authors for your detailed response and for incorporating our feedback into the revised paper. Regarding the second point raised by A1, I am curious if there are any studies or papers that demonstrate ''In the embedding space, a negative word may map the contexts to the negative direction of the same contexts without the negative word'' empirically?

评论

A: Thank you for the question. By this statement, we mean that the embeddings between a positive word and a corresponding negative word are distinct enough. There are several empirical works [Engler et al., 2022; Liu et al., 2024] studying task vectors in the activation/embedding space or the interpretability of word embeddings that can verify this statement to some extent. Section 4.2 of [Engler et al., 2022] concludes that in the binary sentiment prediction task, the word embeddings of antonymous words, such as "unpleasant" and "pleasant", and "tasteless" and "tasetful", are the most discriminative. Note that the embeddings can also refer to the outputs of intermediate layers of a deep network. Table 5 of [Liu et al., 2024] illustrates that by adding the negation of a positive embedding to each layer of a deep Transformer, the model output will change from a positive generation to a negative one. This means that two embeddings, where one is in the negative direction of the other, are decoded as antonymous generation in the setting of [Liu et al., 2024], which justifies our statement.

Engler et al., 2022. SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings. At EMNLP findings.

Liu et al., 2024. In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering. At ICML

评论

Thank you for finding the relevant papers. I would recommend including a discussion of them to justify Definition 2. I also have an additional question about A3: How do the extended bounds in Theorems 1 and 2 depend on the number of classes, 2c2^c? Do they increase on a log scale? Specifically, are they proportional to cc or 2c2^c when the number of classes is 2c2^c?

评论

Dear Reviewer L24R,

We have added a footnote to introduce the empirical motivation of the studied data formulation for binary classification around Definition 2 on page 5 in the updated version. Thank you for the suggestion.

It is a great question about how our derived bound in Theorems 1 and 2 changes when the number of classes becomes 2c2^c for c>1c>1. We think that to achieve the same generalization performance in Theorems 1 and 2, the required number of iterations and samples increase on a log or a polynomial of log scale of the number of classes. Consider our extension to 2c2^c classification in Appendix E as an example. This 2c2^c classification is essentially solving cc parallel binary classification. However, δ_(i)\delta\_*^{(i)}, the fraction of label-relevant pattern μ_T(i)\mu\_{\mathcal{T}}^{(i)}, i[c]i\in[c], may decrease by cc times in average from the binary classification scenario. Therefore, by condition (iii) of Theorem 1, the number of iterations and samples increases by c2c^2 times, which is a polynomial of log scale of the number of classes 2c2^c. Note that this derivation comes from our problem formulation of 2c2^c classification and our analytical tool of optimization and generalization of Transformers. Additional assumptions and advanced techniques might lead to a tighter bound.

Authors

评论

Thank you for the detailed answer! The rebuttal effectively addressed my concerns and questions, and I enjoyed the discussion. I have raised my score accordingly.

评论

Dear Reviewer L24R,

We are delighted that our response effectively addresses your concerns. Thank you for the discussion and raising the score from 6 to 8.

Authors

审稿意见
6

This work focuses on the task vector in the context of task arithmetic, demonstrating its role in learning and unlearning tasks through experiments. It’s an interesting topic that needs more exploration. The authors find that task correlation affects the performance of the task vector and conduct theoretical justification on the low-rank approximation of the task vector and it's ability to adapt to out-of-domain tasks.

优点

  • Very comprehensive mathematical analysis and theoretical proofs
  • Discussion on the task vector is extensive.

缺点

  • There are some issues with the paper's writing (Formula 4 and Definition 2 in the Section 3.2 is confusing).
  • In the language generation task, only a model with 1.5B parameters is used, and the experimental results are not fully meet expectations (also a noticeable performance loss in so-called irrelevant task).

问题

  • Line 236-238, the conventional attention expression is softmax(WQXXTWKT)WVXsoftmax(W_QXX^TW_K^T)W_VX, why is it written as WVXsoftmax(XTWKTWQX)W_VXsoftmax(X^TW_K^TW_QX) in Formula 4?
  • Line 236-238, what is the meaning of XnX^n?
  • Line 242, Why is xix_i used here, while XX is used in Formula 4?
  • Line 261, Since μT\mu_T and vjv_j are orthogonal, what is the meaning of tokens corresponding to μT\mu_T?
  • How to quantify the relevance of different language generation tasks?Are semantically similar and task-related equivalent?
评论

Q5 (Question 5): How to quantify the relevance of different language generation tasks?Are semantically similar and task-related equivalent?

A5: This is a good question. One possible practical method to approximate the relevance α\alpha between two different language generation tasks is as follows. This is illustrated around Equation 11 of the updated submission. Consider two models Ψ_T_1\Psi^*\_{\mathcal{T}\_1} and Ψ_T_2\Psi^*\_{\mathcal{T}\_2} finetuned on tasks T_1\mathcal{T}\_1 with test set D_1\mathcal{D}\_1 and T_2\mathcal{T}\_2 with dataset D_2\mathcal{D}\_2, respectively. Then, compute α^(Ψ_T_1,Ψ_T_2)=1/2(α^(Ψ_T_1,Ψ_T_2,D_1)+α^(Ψ_T_1,Ψ_T_2,D_2))\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*)=1/2(\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*, \mathcal{D}\_1)+\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*, \mathcal{D}\_2)) where α^(Ψ_T_1,Ψ_T_2,D_j)=1D_jiD_jcosy~_1,ji,y~_2,ji\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*, \mathcal{D}\_j)=\frac{1}{|\mathcal{D}\_j|}\sum_{i\in \mathcal{D}\_j}\cos\left\langle \tilde{y}\_{1,j}^i, \tilde{y}\_{2,j}^i\right\rangle, j=1,2j=1,2, which represents the cosine similarity between the centered output of the input D_j\mathcal{D}\_j using the two finetuned models Ψ_T_1\Psi\_{\mathcal{T}\_1}^* and Ψ_T_2\Psi\_{\mathcal{T}\_2}^*. Here y~_l,ji=y^_l,ji1D_j_iD_jy^_l,ji\tilde{y}\_{l,j}^i=\hat{y}\_{l,j}^i-\frac{1}{|\mathcal{D}\_j|}\sum\_{i\in \mathcal{D}\_j}\hat{y}\_{l,j}^i for l,j1,2l,j\in\\{1,2\\} represents the ii-th output of the fine-tuned model ΨTl\Psi_{\mathcal{T}_l}^* on the test set Dj\mathcal{D}_j. Note that to compute α^(Ψ_T_1,Ψ_T_2)\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2^*}), we do not require the availability of extra models or datasets except Ψ_T_1\Psi\_{\mathcal{T}\_1}^*, Ψ_T_2\Psi\_{\mathcal{T}\_2}^*, and the test set D_1\mathcal{D}\_1 and D_2\mathcal{D}\_2. By computation, α^(Ψ_T_HP1,Ψ_T_HP2)=0.498\hat{\alpha}(\Psi\_{\mathcal{T}\_{HP1}}^*, \Psi\_{\mathcal{T}\_{HP2}}^*)=0.498 and α^(Ψ_T_HP1,Ψ_T_PP)=0.239\hat{\alpha}(\Psi\_{\mathcal{T}\_{HP1}}^*, \Psi\_{\mathcal{T}\_{PP}}^*)=0.239, which indicates that HP1 is more aligned with HP2 than PP. .

Since semantic similarity does not have a formal mathematical formulation, we cannot say it is equivalent to task relation, which we describe in this paper through the mathematical quantity α\alpha. However, we would like to illustrate that these two notions are highly correlated in language generation tasks. This is because learning the next token prediction task is essentially learning the semantic structure from the distribution of the language data. We say HP1 and HP2 are semantically similar since they are from the same author and the same series, which implies a similar writing style and sets of vocabulary. Then, given two similar inputs, the outputs are more likely to be close to each other. Note that this is consistent with the similarity between the two tasks by our definition in Section 3.3. In our theoretical setup, related or aligned tasks mean that two similar label-relevant patterns that correspond to the same label. In our theoretical formulation, we define that T_1\mathcal{T}\_1 and T_2\mathcal{T}\_2 are aligned if their correlation α=μ_T_1μ_T_2>0\alpha=\mu\_{\mathcal{T}\_1}^\top\mu\_{\mathcal{T}\_2}>0, where μ_T_1\mu\_{\mathcal{T}\_1} and μ_T_2\mu\_{\mathcal{T}\_2} are the discriminative patterns of T_1\mathcal{T}\_1 and T_2\mathcal{T}\_2 defined in Definition 2, respectively. Therefore, semantic similarity is a close notion to task relevance/alignment in our language generation task.

评论

I have read the author rebuttal and made any necessary changes to my review.

评论

Dear Reviewer 3aPS,

Thank you for raising the score. We are delighted that our rebuttal helps to address your concerns.

Authors

评论

Q2 (Weakness 2): In the language generation task, only a model with 1.5B parameters is used, and the experimental results are not fully meet expectations (also a noticeable performance loss in so-called irrelevant task).

A2: This is a good question. First, we would like to clarify that task vector and even the state-of-the-art machine unlearning methods [Zhang et al., 2024; Jia et al., 2024; Maini et al., 2024] are empirically shown to result in a performance drop in the retain set, the set we aim to maintain the performance. This is a trade-off between forgetting and retaining performance. For example, Table 3 of [Shi et al., 2024] illustrates that using commonly used machine unlearning techniques will lead to a loss in the utility. Especially, task vector method decreases 13.113.1\\% performance on the whole dataset BOOKS in average. Therefore, the performance loss on the less aligned task with “Pride and Prejudice” in our paper, which is 15.1315.13\\% for full-rank task vector in Table 3, and 13.8313.83\\% for low-rank task vector in Table 4, is not bad. This is an inherent and reasonable performance degradation of the task vector approach.

Second, we would like to emphasize that the key conclusion of the language generation task is the comparison between the forgetting performance of the aligned task T_HP2\mathcal{T}\_{HP2} and the less-aligned task T_PP\mathcal{T}\_{PP}. Note that the performance decrease of T_PP\mathcal{T}\_{PP} is much smaller than T_HP2\mathcal{T}\_{HP2}, while the performance decrease of T_HP2\mathcal{T}\_{HP2} is close to that of T_HP1\mathcal{T}\_{HP1}. This verifies Theorem 2 that unlearning a task THP1\mathcal{T}_{HP1} can significantly degrade the performance of the aligned task (T_HP2\mathcal{T}\_{HP2}) while the accuracy of the less-aligned task (T_PP\mathcal{T}\_{PP}) decreases less. We have revised the sentence in lines 507-509 accordingly to help the understanding.

We thank the suggestion of trying larger models. We use Phi-3-small (7B) to repeat the experiment using LoRA. The result is shown below in the Table. We can see that the insight of Theorem 2 still holds, i.e., unlearning a certain task (HP1) can effectively forget the aligned task (HP2) with a performance loss of 52.2952.29\\%, but forget the less-aligned task less (PP) with a performance loss of 20.65%20.65\%. Moreover, by using a larger model, the unlearning performance of the aligned task HP2 is improved from a 37.2337.23\\% decrease to a 55.6155.61\\% decrease. In comparison, the performance difference on the less-aligned PP is much smaller, from a 15.1315.13\\% decrease to a 20.6520.65\\% decrease.

λ\lambda0 (baseline)-0.2-0.4-0.6-0.8-1
T_HP1\mathcal{T}\_{HP1}0.25730.19890.19330.18880.15420.1142 (55.6155.61\\%)
T_HP2\mathcal{T}\_{HP2}0.26880.21130.19930.19380.16220.1563 (52.2952.29\\%)
T_PP\mathcal{T}\_{PP}0.19420.18250.16440.16870.15920.1541 (20.5420.54\\%)

Shi et al., 2024. MUSE: Machine Unlearning Six-Way Evaluation for Language Models.

Zhang et al., 2024. From Catastrophic Collapse to Effective Unlearning. COLM 2024.

Jia et al., 2024. WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models. Neurips 2024.

Maini et al., 2024. Tofu: A task of fictitious unlearning for llms. Preprint.


Q3 (Question 3): Line 242, Why is x_i used here, while X is used in Formula 4?

A3: Line 242 computed the attention weight between xlx_l and xix_i, which is a scalar. The original formula 4 computes the vector version of the attention, i.e., the vector of attention weights between xlx_l and every xjx_j for j[P]j\in[P]. We have changed Formula 4 to the scalar version to be consistent with that in lines 242-243 and help the understanding. Thank you for pointing it out.


Q4 (Question 4): Line 261, Since μ_T\mu\_\mathcal{T} and v_jv\_j are orthogonal, what is the meaning of tokens corresponding to μ_T\mu\_\mathcal{T}?

A4: We define that “each token is randomly chosen from μT,μTv1,,vM\\{\mu_{\mathcal{T}}, -\mu_{\mathcal{T}}\\}\cup\\{v_1,\cdots,v_M\\}“. Then, “tokens corresponding to μ_T\mu\_\mathcal{T}” refers to “tokens equal to μ_T\mu\_\mathcal{T}”. We have made revisions accordingly to avoid the confusion.

评论

We thank the reviewer for the valuable time and effort in the evaluation. We have made revisions in Equation 4, Equation 11, and Definition 2 and included Table 6 of key notations in Appendix B according to the review. We also improved our discussion of our language model experiment in lines 507-509 and added an illustration of the result of less-aligned task in lines 538-539.

Q1 (Weakness 1 & Question 1 & 2): There are some issues with the paper's writing (Formula 4 and Definition 2 in the Section 3.2 is confusing). Line 236-238, the conventional attention expression is softmax(WQXXWK)WVXsoftmax(W_Q XX^\top W_K^\top)W_V X, why is it written as WVX(XWKWQX)W_V X(X^\top W_K^\top W_Q X) in Formula 4? Line 236-238, what is the meaning of XnX^n?

A1: Thank you for the question. We have added table 6 at the beginning of Appendix B to summarize important notations to improve the readability.

Equation 4 should be revised as f(X;Ψ)=1Pl=1Pa(l)Relu(WOWVs=1Pxssoftmaxl(xsWKWQxl)).f( X; \Psi)=\frac{1}{P}\sum_{l=1}^P a_{(l)}^\top\text{Relu}(W_O W_V\sum_{s=1}^P x_s\text{softmax}_l({x_s}^\top W_K^\top W_Q x_l)). In the original version, there is an XnX^n by mistake, which should be XX. We also change the vector version of attention in the original Equation 4 into a scalar version to facilitate understanding. We are sorry for the typo and the confusion.

The formulation of our Transformer model is also used by [Li et al., 2023; Zhang et al., 2024; Huang et al., 2024]. We think the reviewer refers to the formulation of softmax(XWQWKX)XWVsoftmax(X W_Q W_K^\top X^\top) X W_V in Question 1 so that the softmax attention is computed based on the number of tokens. This formulation is essentially the same as our original correct version WVXsoftmaxl(XWKWQxl)) W_V X\text{softmax}_l({ X}^\top W_K^\top W_Q x_l)) by transposing XX, WQW_Q, WKW_K, and WVW_V.
For Question 3, XnX^n should be first introduced in line 245, which denotes the nn-th input data XRd×PX\in\mathbb{R}^{d\times P}.

Li et al., 2023. A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity. At ICLR.

Zhang et al., 2024. Trained Transformers Learn Linear Models In-Context. At JMLR.

Huang et al., 2024. In-context convergence of transformers. At ICML.

审稿意见
8

This paper explores the theoretical aspects of task vector arithmetic as a model editing technique for multi-task learning, unlearning, and out-of-domain generalization. The authors provide a theoretical analysis to justify why and when task vector methods are effective in nonlinear Transformer models, especially for binary classification tasks. They prove that task addition facilitates multi-task learning for aligned or irrelevant tasks, while task negation can effectively unlearn contradictory or irrelevant tasks. Additionally, they offer generalization guarantees for out-of-domain tasks and theoretical justification for task vector approximations. These findings are empirically validated through various experiments.

优点

  • The paper is very well-written and easy to follow.
  • It provides a guideline for when and why task arithmetic works in multi-task learning, machine unlearning, and generalization to new tasks.
  • The discussion of low-rank approximations and magnitude-based pruning of task vectors supports the use of efficient approximation techniques in task arithmetic fine-tuning.
  • This is the first known theoretical generalization analysis of task vector arithmetic in nonlinear Transformer-based models, filling a notable gap in the literature.
  • The theoretical claims are validated through empirical experiments on the Phi-1.5 language model and Colored-MNIST image classification, adding practical credibility to the proposed framework.

缺点

  • The theoretical analysis relies on a single-head, one-layer Transformer model, which may limit the applicability of the results to more complex multi-layer Transformer architectures.
  • While the empirical validation includes a large language model and a basic image classification task, the study could benefit from a broader set of tasks, including more complex or structured tasks beyond binary classification.
  • Although the theoretical framework outlines conditions for selecting arithmetic coefficients, more practical guidelines or analyses for tuning these coefficients in real-world applications would be beneficial.

typos:

  • line 288, "fine-turning" -> "fine-tuning"
  • line 388, "are are" -> "are"

问题

I don't have questions.

评论

We thank the reviewer for the valuable time in the evaluation. We have included a more practical guideline for approximating correlations between tasks and tuning task vector hyperparameters by revising Equation 11 in the updated submission.

Q1 (Weakness 1): The theoretical analysis relies on a single-head, one-layer Transformer model, which may limit the applicability of the results to more complex multi-layer Transformer architectures.

A1: This is a good question. First, the state-of-the-art theoretical works [Li et al., 2023; Makkuva et al., 2024] recently published at top ML conferences like ICLR, Neurips, and ICML on the theoretical generalization and/or learning dynamics of Transformers also focus on one-layer single-head Transformers only. That is because the loss landscape for multi-layer/head Transformers is highly nonlinear and non-convex due to the interactions between multiple nonlinear functions. The extension to multi-layer multi-head Transformers requires a more complicated characterization of the gradient updates, while the simplified model architecture can benefit the theoretical analysis. For the theoretical generalization guarantees of Transformers beyond one-layer single-head, to the best of our knowledge, only [Tian et al., 2023; Chen et al., 2024a; Chen et al., 2024b] theoretically study learning with multi-head or multi-layer Transformers. To the best of our knowledge, no existing works theoretically study the generalization of merging multiple task vectors, even with one-layer single-head Transformers. Therefore, we focus on the one-layer analysis to obtain some initial theoretical insights.

Second, the simplification of one-layer single-head Transformers enables us to make contributions under our theoretical settings. Our work is the first one to prove the effectiveness of task addition and negation with Transformer models using feature-learning analysis. Our work also proves guaranteed generalization of task arithmetic in out-of-domain generalization. The simplification also enables the theoretical proof of the validity of low-rank approximation and magnitude-based pruning for task vectors.

Third, the theoretical conclusions of one-layer single-head Transformers are empirically verified by multi-layer multi-head Transformers to some extent. We justify our theoretical findings on multi-layer multi-head real-world Transformer architectures ViT-Small/16 and Phi-1.5, respectively. Experiments in Figures 1, 2, Tables 3, and 4 justify our Theorem 1, 2, and 3. This implies that our generalization results of task arithmetic on different tasks can potentially be extended to multi-layer multi-head cases. We leave the detailed theoretical analysis of the extension as future works.

Li et al., 2023. A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity. At ICLR.

Jiang et al., 2024. Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization. At Neurips.

Tian et al., 2023. Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer. At Neurips.

Makkuva et al., 2024. Local to Global: Learning Dynamics and Effect of Initialization for Transformers. At Neurips.

Chen et al., 2024a. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. Preprint.

Chen et al., 2024b. Provably learning a multi-head attention layer. Preprint.

评论

Q2 (Weakness 2): While the empirical validation includes a large language model and a basic image classification task, the study could benefit from a broader set of tasks, including more complex or structured tasks beyond binary classification.

A2: Thank you very much for the insightful question. First, we would like to emphasize that the proposed LLM unlearning experiment is a generative task, which goes beyond binary classification.

Second, we sincerely appreciate the encouragement to explore more complex or structured tasks. Since this work primarily focuses on theoretical contributions and insights, we prioritized experiments on the binary classification task using the Colored-MNIST dataset. This simpler setup was chosen because it effectively allows us to validate and illustrate the theoretical insights derived in our study. For example, we prove that task addition is effective for multi-task learning when the tasks are either irrelevant or aligned (Theorem 1), while task negation is provably successful for unlearning tasks that are either irrelevant or contradictory (Theorem 2). A linear combination of these task vectors can generalize to a wide range of new tasks by properly selecting the arithmetic coefficients (Theorem 3). To further substantiate our theoretical results, we extended our experiments to include machine unlearning tasks on LLMs. We believe that the selected tasks and datasets are representative, ensuring that our theory is both well-supported and practically applicable. Thank you again for your valuable feedback, which will guide future extensions of this work.


Q3 (Weakness 3): Although the theoretical framework outlines conditions for selecting arithmetic coefficients, more practical guidelines or analyses for tuning these coefficients in real-world applications would be beneficial.

A3: This is a great question. Our Theorems 1 and 2 show that, given the correlation α\alpha between tasks T_1\mathcal{T}\_1 and T_2\mathcal{T}\_2, we can determine whether the merged model Ψ=Ψ(0)+ΔΨT_1+λΔΨT_2\Psi=\Psi^{(0)}+\Delta\Psi_{\mathcal{T}\_1}+\lambda \Delta\Psi_{\mathcal{T}\_2} can take effect and what λ\lambda to choose based on Theorems 1 and 2.

One possible practical method to approximate α\alpha is as follows. Consider two models Ψ_T_1\Psi^*\_{\mathcal{T}\_1} and Ψ_T_2\Psi^*\_{\mathcal{T}\_2} finetuned on tasks T_1\mathcal{T}\_1 with test set D_1\mathcal{D}\_1 and T_2\mathcal{T}\_2 with dataset D_2\mathcal{D}\_2, respectively. Then, compute α^(Ψ_T_1,Ψ_T_2)=1/2(α^(Ψ_T_1,Ψ_T_2,D_1)+α^(Ψ_T_1,Ψ_T_2,D_2))\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*)=1/2(\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*, \mathcal{D}\_1)+\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*, \mathcal{D}\_2)) where α^(Ψ_T_1,Ψ_T_2,D_j)=1D_jiD_jcosy~_1,ji,y~_2,ji\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2}^*, \mathcal{D}\_j)=\frac{1}{|\mathcal{D}\_j|}\sum_{i\in \mathcal{D}\_j}\cos\left\langle \tilde{y}\_{1,j}^i, \tilde{y}\_{2,j}^i\right\rangle, j=1,2j=1,2, which represents the cosine similarity between the centered output of the input D_j\mathcal{D}\_j using the two finetuned models Ψ_T_1\Psi\_{\mathcal{T}\_1}^* and Ψ_T_2\Psi\_{\mathcal{T}\_2}^*. Here y~_l,ji=y^_l,ji1D_j_iD_jy^_l,ji\tilde{y}\_{l,j}^i=\hat{y}\_{l,j}^i-\frac{1}{|\mathcal{D}\_j|}\sum\_{i\in \mathcal{D}\_j}\hat{y}\_{l,j}^i for l,j1,2l,j\in\\{1,2\\} represents the ii-th output of the fine-tuned model ΨTl\Psi_{\mathcal{T}_l}^* on the test set Dj\mathcal{D}_j.

Note that to compute α^(Ψ_T_1,Ψ_T_2)\hat{\alpha}(\Psi\_{\mathcal{T}\_1}^*, \Psi\_{\mathcal{T}\_2^*}), we do not require the availability of extra models or datasets except Ψ_T_1\Psi\_{\mathcal{T}\_1}^*, Ψ_T_2\Psi\_{\mathcal{T}\_2}^*, and the test set D_1\mathcal{D}\_1 and D_2\mathcal{D}\_2. We have revised Equation 11 accordingly. We find that the values of αˉ\bar{\alpha} for the aligned, irrelevant, and contradictory tasks studied in Figure 1 for Colored-MNIST of our paper are 0.8910.891, 0.1640.164, and 0.849-0.849, which is aligned with our formulation of α>0\alpha>0, α0\alpha\approx0, and α<0\alpha<0 for the three scenarios. For our language generation tasks, α^(Ψ_T_HP1,Ψ_T_HP2)=0.498\hat{\alpha}(\Psi\_{\mathcal{T}\_{HP1}}^*, \Psi\_{\mathcal{T}\_{HP2}}^*)=0.498 and α^(Ψ_T_HP1,Ψ_T_PP)=0.239\hat{\alpha}(\Psi\_{\mathcal{T}\_{HP1}}^*, \Psi\_{\mathcal{T}\_{PP}}^*)=0.239, which indicates that HP1 is more aligned with HP2 than PP.

Then, our Theorems 1 and 2 indicate that by trying some large λ>1α\lambda>1-\alpha, the merged model can perform well on both two tasks if α^0\hat{\alpha}\geq 0. By using some λ0\lambda\leq 0, the merged model can forget T2\mathcal{T}_2 while maintain the performance on T1\mathcal{T}_1 if α^0\hat{\alpha}\leq 0.


Q4: typoes.

A4: Thank you for pointing them out. We have corrected them in the updated submission.

评论

Dear Reviewer 3a5G,

Thank you for raising the score from 6 to 8. We are encouraged by your support.

Authors

评论

Dear Reviewers/AC/SAC/PC,

We appreciate the evaluation and suggestions and the discussion with you. We uploaded a revision of our manuscript according to the review. In this revision, in addition to the changes discussed in the previous response, we also added the result of the correlation between language datasets HP1 and HP2 and between HP1 and PP as promised earlier to Reviewers 3a5G and 3aPS in the response and the manuscript. This shows that HP1 is more aligned with HP2 than PP, which verifies our experimental results in Section 4.2.

We summarize major changes in the revision during the rebuttal process.

  1. More discussion and experiments: We have improved the discussion of our language model experiment in lines 507-509 and added the language generation experiment using Phi-3-small (7B) in Appendix A and an illustration of the results of the less-aligned task in footnote 4 as suggested by Reviewer 3aPS. We have added footnote 1 to introduce the empirical motivation of the data formulation and Appendix D to discuss the extension to the multi-classification case as suggested by Reviewer L24R.

  2. Presentation: We have corrected Equation 4, Equation 11, and Definition 2 and included Table 6 of key notations in Appendix B as suggested by Reviewers 3a5G, 3aPS, and QzER.

  3. Typos: We have corrected typos as mentioned by Reviewer 3a5G and 3aPS.

Thanks,

Authors

AC 元评审

The submission theoretically analyses task vector arithmetic in transformers, demonstrating the extent to which multi-task learning, unlearning, and out-of-distribution generalisation can be achieved. The theoretical analysis is corroborated by experiments. All reviewers were excited by the theoretical results—the first such results developed in the context of task vectors in transformers. The additional experiments were well received, and the overall presentation of the work was praised. The reviewers did not point out any significant weaknesses in the work.

审稿人讨论附加意见

Due to the generally positive reviews, there was not much discussion required for this paper. The authors rebuttal was well-received and several reviewers increased their score.

最终决定

Accept (Oral)