Update Your Transformer to the Latest Release: Re-Basin of Task Vectors
We propose a data-free method to transfer fine-tuned Transformer model updates to newer pre-trained versions using weight permutations, enabling fine-tuning updating without retraining.
摘要
评审与讨论
This paper introduces TransFusion, a method that re-basins task vectors by aligning model weights to base models in the parameter space, aiming at adapting task vectors to a later version of the model. In particular, TransFusion employs a two-step permutation process: inter-head alignment using spectral distance and intra-head matching to align each pair of head matrices. Experiments on vision and NLP tasks show that TransFusion outperforms existing re-basin methods, preserving fine-tuning effects with improved zero-shot generalization.
update after rebuttal
The rebuttal helped address my concerns on functional equivalence, so I raised my score to 3.
给作者的问题
Please refer to Strengths And Weaknesses
论据与证据
While the paper presents feasible empirical results, some claims lack theoretical guarantees or deeper analysis. For example, It is not explicitly shown how well TransFusion aligns two self-attention layers.
方法与评估标准
While the paper presents promising results, it is unclear whether the proposed two-step permutation process fully preserves the functional equivalence of the self-attention mechanism. In objective functions (8) and (9), each permutation matrix is optimized separately. The approach appears to lack explicit constraints to guarantee that the transformed model is functionally equivalent to the original model. Further clarification or theoretical verification would strengthen the evidence for the effectiveness of TransFusion.
In addition, complexity is an important metric for evaluating the efficiency of re-basin [1]. However, this paper does not include the complexity of TransFusion theoretically or empirically.
[1] Ainsworth, Samuel, Jonathan Hayase, and Siddhartha Srinivasa. "Git Re-Basin: Merging Models modulo Permutation Symmetries." The Eleventh International Conference on Learning Representations.
理论论述
I checked the correctness of Theorem A.1, i.e., permutation preserves the spectral distance. This theorem provides motivation for choosing the spectral distance to measure the alignment of different attention head matrices.
实验设计与分析
- The paper does not clearly specify the details of the transformer models used for NLP experiments
- The paper evaluates only one transformer model backbone, limiting the generalizability of the findings. It would be helpful to include more experiments on different architectures to ensure the robustness and scalability of the proposed TransFusion.
补充材料
I reviewed the whole supplementary material.
与现有文献的关系
This paper proposes an important application of model re-basin (parameter matching): to adapt task vectors to different versions of the base model. Additionally, it proposes a novel permutation for the self-attention mechanism to align the multi-head attention parameters while preserving the functional equivalence. This can benefit further studies on permutation invariance and mode connectivity of transformers.
遗漏的重要参考文献
I don't find any missing references.
其他优缺点
Strengths:
- This study extends parameter matching techniques to transformer models, improving the alignment of self-attention layers by permutation.
- This paper is overall easy to follow and well-organized.
Weaknesses:
- The definition of symbols is a bit confusing. is first defined as the number of heads and then defined as the head matrix. Additionally, the head matrix should have a subscript (q, k, or v) to indicate the role of the matrix.
- It is mentioned that the proposed head alignment techniques address the head contamination problem. However, it is unclear to me why we need to tackle this issue. In my opinion, if the rows of different head matrices align better than rows in the same head matrix, permuting the rows across different head matrices is acceptable as long as the functional equivalence is preserved.
- It is strange to separately optimize and as the result permutation might not preserve the functional equivalence of a self-attention layer, which is a prerequisite for re-basin. It would be better to explicitly show how the joint application of and leads to functional equivalence.
- Although I like the novel problem setting of adapting task vectors to different base models, this new problem setting does not bring distinct technical challenges. The novel part of two-step attention head alignment only solves the head contamination problem (with an unclear motivation to me). It would be helpful to highlight the technical challenges and contributions of TransFusion for better significance.
其他意见或建议
Please refer to Strengths And Weaknesses
Functional equivalence
The central criticism of the reviewer revolves around the preservation of functional equivalence in self-attention layers. To address this justifiable concern, we provide a proof showing that our two-stage method ensures functional equivalence.
Theorem Let be a permutation of attention heads, and be a set of permutations applied independently within each head. Then, applying followed by preserves the functional equivalence of the multi-head self-attention layer.
Notation:
- : input sequence of shape
- : weight matrices each of shape
- : number of heads
- : number of units within each head
The self-attention layer:
- computes matrices as , ,
- splits Q into heads, Q_1,Q_2,...,Q_{|H|} (the same for and ).
Step 1 The two-stage permutation procedure involving inter- and intra-head swaps can be summarized by a single, composed permutation matrix . Such a matrix of shape has a notable form, namely a grid of block matrices. Each block has shape and is either filled with zeros or a valid matrix permutation of units. Moreover, in each row and column, there can be only one permutation block. An example with three heads:
P_{attn}=
P_{intra}^0, 0, 0
0, 0, P_{intra}^1
0, P_{intra}^2, 0
with heads swapped according to :
P_{inter}=
1, 0, 0
0, 0, 1
0, 1, 0
For the sake of notation, we will use the permutation vector to denote how swaps heads. In this example .
Notably, a block matrix with the features mentioned above can be written as:
where is the Kronecker product and is a matrix filled with zeros except at position , where it contains a one.
Step 2 When applying the block matrix to the query weight , we get the following query tokens:
Q_{\pi^{-1}(i)} P_{intra}^{\pi^{-1}(i)}
where refers the block at position (which can be either zero-filled or not). The relation is relevant as it entails:
The new head corresponds to the head designated by the inter-head permutation , modified according to the permutation applied to its units. Note that the same result applies to and .
Step 3: attention For each head, the attention matrix (after permutation) is:
Thanks to the orthogonality of the intra-head permutation blocks, the attention scores are only influenced by the inter-head permutations.
Step 4: output For each head, the output is:
The final output is obtained by concatenating all heads:
O_1^{'},O_2^{'},...,O_{|H|}^{'} $
_{i=1}^{|H|}=\big[ \sum_{j=1}^{|H|} O_{j} P_{attn}[j,i] \big]_{i=1}^{|H|}=O P_{attn} $
Result To sum up, applying a block permutation to each projection matrix is equivalent to permuting the output of the self-attention mechanism . This demonstrates functional equivalence in the context of our approach. When the output is fed to the next layer, we have to multiply it by to recover the original output.
In my opinion, if the rows of different head matrices align better than rows in the same head matrix, permuting the rows across different head matrices is acceptable as long as the functional equivalence is preserved.
Due to space constraints, a detailed discussion is not possible. It is important to note that permuting rows across heads means that the blocks within at each head level are not necessarily orthogonal. Hence, the associated term does not cancel out when calculating the input to the softmax function. This creates a challenge in reversing the effects of the permutation, which is necessary to ensure that functional equivalence is maintained after applying the softmax function.
Other aspects
For details on the backbones used in our NLP experiments, as well as additional tests across different model sizes and a complexity analysis, please see our response to yjkC.
This paper addresses a critical challenge for foundation models, i.e., fine-tuned models becoming obsolete when their base models are updated. The authors introduce TransFusion, a data-free method to transfer task vectors from an old base model to a new one. By leveraging a structured permutation strategy tailored for Transformers, i.e., addressing multi-head attention alignment and residual connections through spectral theory, the method ensures functional equivalence while preserving generalization. Extensive experiments on vision (e.g., EuroSAT, SVHN) and NLP tasks (e.g., GLUE) demonstrate significant improvements over baselines including Git Re-Basin, validating the effectiveness and robustness of the proposed approach.
给作者的问题
1.Could the method handle minor architectural changes (e.g., added layers) via adaptive permutations? 2.What does the limited gains on DTD mean? Is it related to task vector quality or dataset characteristics? 3.How does the computational complexity scale with model size (e.g., ViT-B vs. ViT-L)?
论据与证据
The core claims are well-supported. As shown in Table 1, the proposed method consistently outperforms baselines in zero-shot task accuracy (e.g., +4.95% on EuroSAT, Table 1) while maintaining generalization (e.g., minimal drops on ImageNet-R). Figure 5 and Table 3 also show the advantages of using Functional Equivalence and Transformer-Specific Design.
方法与评估标准
The proposed methods are reasonable, building upon established neural network permutation invariances and extending them thoughtfully to Transformer architectures. Their innovative two-step spectral-based attention head alignment and residual connection permutation strategy effectively address previously unresolved challenges (head contamination and residual mismatch). The chosen evaluation metrics, including zero-shot accuracy on downstream tasks and generalization to support sets, are appropriate and directly relevant to assessing the effectiveness and practical applicability of the proposed solution.
理论论述
The authors provide rigorous theoretical analyses, particularly regarding their proposed permutation-invariant spectral distance metric for attention head alignment. Their mathematical proof in Appendix A.1 confirms that singular value-based distance metrics are indeed invariant to permutations, logically and rigorously justifying the attention alignment strategy. Additionally, their handling of residual connections is clearly explained and mathematically sound, ensuring consistency across permutations and preserving functional equivalence.
实验设计与分析
Experimental design is thorough, comprehensive, and rigorous. Evaluations span multiple tasks, including image classification and NLP benchmarks, demonstrating cross-modal applicability. Baselines (e.g., Git Re-Basin, Optimal Transport, and vanilla transport) are effectively selected to illustrate clear performance distinctions. The results consistently show substantial improvements from TransFusion, with sufficient statistical robustness inferred from the consistently large margins observed.
补充材料
Supplementary materials substantially enhance the paper's credibility and completeness. Appendix A.1 provides critical theoretical proofs supporting spectral invariance, while Appendix A.2 details residual connection permutation implementation, facilitating reproducibility and deeper theoretical understanding. The additional sensitivity analyses in Appendix A.3 further confirm the robustness and generality of the proposed method.
与现有文献的关系
The paper is well-connected to existing literature. The following works are also suggested to be discussed: [1] Zhou et al. DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration. In ICCV, 2023. [2] Yamaguchi et al. Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks. In CVPR, 2023.
遗漏的重要参考文献
No critical missing references were identified. The existing cited literature covers the majority of essential works thoroughly, and any minor omissions do not detract from understanding or evaluating the paper’s core contributions.
其他优缺点
Strengths: +This paper proposes a novel solution to Transformer re-basin with theoretical guarantees. +The proposed method enables cost-effective model updates without data access. +This paper provides strong empirical validation across modalities and tasks.
Weaknesses: -The propsoed method requires identical model architectures. Will slight changes (e.g., layer counts) deteriorate the alignment? -How is the scalability of the permutation alignment for large models? -How does the quality of the task vectors affet the performance of the proposed method?.
其他意见或建议
No.
Complexity analysis
To assess how computational complexity scale with model size, we define:
- : number of layers, evenly divided into MLP () and self-attention ().
- : number of attention heads.
- Each MLP layer contains two linear projections with dimension and . We assume for simplicity.
- Self-attention layers have Q, K, and V matrices, each with dimensions .
We now estimate the complexity for MLP and self-attention layers for a single iteration of the weight-matching algorithm.
MLP Layers
Permutation alignment for MLP layers resembles GiT Rebasin. The main computational cost involves computing a matrix of pairwise dot products between rows in the projection matrix, with complexity O(d\_m^3\). Subsequently, applying the Hungarian algorithm to a matrix also has complexity . Thus, each MLP layer incurs .
Self-Attention Layers
The analysis is split into two steps: inter-head and intra-head permutations.
Inter-head permutation:
- Computing singular value decompositions (SVDs) for matrices Q, K, V across networks (A,B) and heads results in SVDs. Each SVD on head-level matrices sized has complexity . Hence, total complexity for all SVDs is .
- Computing distance matrices for Q, K, V, each sized , involves operations per element, hence the complexity is .
- Applying the Hungarian algorithm to the resulting matrix incurs complexity: . Thus, inter-head permutation complexity is: .
Intra-head permutation:
- Computing similarity matrices for intra-head permutations involves cost matrices of dimensions , incurring complexity per head: =.
- Applying the Hungarian algorithm separately for each head results in a per-head computational complexity of . Summing across all heads, total intra-head permutation complexity is .
Therefore, total complexity per self-attention layer is: .
Overall Complexity
Considering all layers, the total complexity combines the contributions from MLP and self-attention layers, yielding:
This complexity scales polynomially, dominated by terms involving and . The method remains substantially more efficient than full retraining, avoiding costly gradient computations or data iterations.
Aligning models with different architectures
The idea of aligning models with minor variations in architecture, such as different layer counts, is worth exploring. To address this, one simple approach could involve selectively pruning layers from the model with more layers to match its counterpart. For instance, one could remove redundant and unimportant layers. Alternatively, one could adopt the opposite strategy and replicate the last block of the smaller network multiple times to achieve a match. We plan to explore these aspects further in future work.
Related works
Both our work and that mentioned by the reviewer aim to improve transfer during fine-tuning. However, while DR-Tune and AdaRand assume access to downstream task data during fine-tuning, our method requires no training data to transfer fine-tuning from one model to another. Moreover, these solutions employ data-driven regularization during optimization; instead, our approach uses a parameter alignment technique that eliminates the need for training steps.
Quality of task vectors
... How does the quality of the task vectors affect the performance of the proposed method? ... What does the limited gains on DTD mean? Is it related to task vector quality or dataset characteristics?
Our approach follows a twofold procedure, i.e., align the two models and , and then transport the task vector . Regarding the quality of the task vector, the first stage is unaffected as it solely depends on and . The second step, instead, employs and thus it could be affected by low-quality fine-tuning (after all, garbage in, garbage out). For instance, when considering DTD, we argue that the lower quality of the task vector is directly attributable to the challenging characteristics of the dataset, which features only 40 examples per class on average (for comparison, the other datasets we use have at least around 1000 examples per class).
This paper explores how to update a fine-tuned downstream model from an older version of a pre-trained model to a newer pre-trained model without requiring re-finetuning. The paper proposes the TransFusion method, which incorporates attention head matching, alignment, and residual connection handling. Experiments on CV and NLP tasks demonstrate the effectiveness of the proposed approach.
Post-rebuttal
I will maintain my score of 4.
给作者的问题
see above
论据与证据
The contributions claimed in the paper are supported by the experimental results.
方法与评估标准
The proposed method meets the requirements of the intended application scenario.
理论论述
The theoretical proofs in the paper seem correct, but since I am not a researcher in this specific field, I cannot be completely certain.
实验设计与分析
The experiments are designed with pretrained models and datasets from both the language and vision domains, demonstrating the effectiveness of the proposed method.
补充材料
The appendix includes proofs and additional experimental results.
与现有文献的关系
If the pre-training and fine-tuning processes are viewed as two different tasks, the problem being studied may be related to continual learning, which aims to mitigate catastrophic forgetting when learning new knowledge.
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- The problem studied in the paper is practically meaningful, and the motivation behind the proposed method is clearly articulated.
- The proposed method is well-supported by theoretical foundations.
Weaknesses:
- Some experimental details are not clearly specified. For example, which pre-trained models were used in the NLP experiments?
- The paper only conducts experiments on classification tasks. However, the problem it addresses is clearly more relevant to large language models. Can this method be applied to models at the 7B scale?
其他意见或建议
N/A
Backbones used in NLP experiments
... For example, which pre-trained models were used in the NLP experiments?
We sincerely apologize for the omission. In our experiments, we used two variants of OpenCLIP's ViT-B-16 text encoder: "ViT-B-16/commonpool-l-s1b-b8k" for and "ViT-B-16/datacomp-l-s1b-b8k" for . We specifically chose these versions over models like BERT or T5 because we needed two variants of the same base model trained in different ways to represent ours and .
Application to LLMs
... The paper only conducts experiments on classification tasks. However, the problem it addresses is clearly more relevant to large language models. Can this method be applied to models at the 7B scale?
Our method can be seamless applied to scenarios beyond classification (e.g., detection, segmentation, VQA). This flexibility arises from our approach not relying on classification-specific techniques nor imposing any assumptions regarding the type of loss function used during optimization.
Regarding the application to 7B models such as LLaMA 2 and Mistral7b, this represents an intriguing area that we plan to explore in future work. We consider it feasible; indeed, as discussed in the response to Reviewer gzgf, the complexity of weight matching scales linearly with the number of layers. We also advocate for restricting fine-tuning operations to self-attention layers only, for efficiency reasons. Indeed, our method exhibits appealing computational complexity for these layers, as noted again in the response to Reviewer gzgf. Notably, this selective fine-tuning approach aligns conceptually with established state-of-the-art techniques such as QLORA [a], which fine-tune only the self-attention layers while excluding feed-forward layers. Consequently, re-basin operations could similarly be confined exclusively to self-attention layers, given that no information transfer is necessary for the remaining layers (as the task vector would be null).
[a]: Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs, NeurIPS 2023.
Thank you for the author’s efforts. I have no further questions. If no other reviewers raise strong objections in the subsequent discussion, I will maintain my score of 4.
The introduces a new rebasin method for models for keeping models up-to-date as their underlying pretrained backbones evolve, focusing on transformers in particular. The author's method involves a “transport” the fine-tuning modifications—captured as a task vector—from the older base model to a new checkpoint without re-training or using data. The paper introduces TransFusion, a data-free and training-free re-basin procedure that realigns the task vector to be applicable on a new model release. The paper applies their method to visual and NLP tasks with ViT-B and different NLP classifiers.
给作者的问题
N/A
论据与证据
The core concern I have about the claims of the paper are as follows
- Assumption of structural similarity of models: One core assumption of the method is that the new checkpoint is largely similar to tthe original . Is this always true? For instance, if is trained under different conditions such as a new data distribution or slight architectural tweaks, then the permutation alignment may not capture necessary transformations. Moreover, when the representation spaces of your two networks diverge, then the transport of could be suboptimal or even detrimental.
- Scalability/Computational Overhead: The procedure requires solving a series of linear assignment problems for every transformer layer and for each attention head pair. Although the Hungarian algorithm is polynomial in complexity, the cumulative cost might become significant for very deep or wide models.
- Hyperparameter Tuning: Could the authors discuss more on how to choose in practice? I worry that sensitivity may make this difficult to tune.
- Disadvantages of a data free transfer: One major disadvantage and limitation I would have liked to see discussed is about cases where domain-specific adjustments are necessary. In many real-world cases, even a minimal dataset could help fine-tune the alignment further, suggesting that a semi-supervised or data-augmented variant might offer better performance. It would be helpful for the authors to discuss when we should use TransFusion.
方法与评估标准
I see no issues with the methods and evaluation criteria. The authors evaluate on well-known NLP tasks which seem reasonable for this paper although I would have like to see pretrained decoder models. I would be curious about evaluating on a dataset like GSM after finetuning to see capabilities of the model on a dataset that is (reasonably) potentially out of distribution.
理论论述
I checked proofs in the appendix and didn't see any issues. However, I do have one concern about theoretical aspects of this work.
- Spectral Metric for Head Matching: While using singular values to form a permutation-invariant distance seems justified, it's not clear if this metric always aligns heads based on functional roles. Two heads with similar singular value profiles might perform very different roles in the network. This could lead to cases where heads that are similar in “energy” are paired, but the semantic content of their learned representations is not matched, potentially reducing the efficacy of the transport.
- Residual Connections: The method decides redefine the identity mapping in residual connections. To me, this may be sensitive to the specific ordering and scale of the permutation matrices. And, this also keeps this method from applying to other non-transformer based methods.
实验设计与分析
- The authors should expand on what encoder language model they use. Is it BERT, T5, etc.
- It would be great for the authors to discuss how their method scales to deeper and wider models. If the authors could run different models sizes, that would be very useful for contextualizing their approach.
补充材料
No supplementary material was uploaded.
与现有文献的关系
The most significant paper that is similar to this approach is obviously Git Rebasin (ICLR 2023). The authors cite and discuss this paper at length to explain the similarities and differences of their approach. The authors also discuss task arithmetic and weight interpolation at length, as this is relevant to their approach.
遗漏的重要参考文献
None that I can find.
其他优缺点
N/A
其他意见或建议
N/A
Clarification
We clarify that our paper does not assume similarity between and . As stated in Sec. 1 (line 67) and 3, these checkpoints may result from training on distinct data distributions or techniques. The re-basin mechanism is indeed designed to align models with significant differences in parameter space.
Using a small dataset to enhance transport
We agree; if even a small amount of data were available for each class, it would certainly be beneficial to use it. However, there are real-world scenarios where retaining data is not possible. That said, if these constraints do not apply, our method can be effectively utilized in conjunction with fine-tuning. To show this, starting from a small subset of shots per class, we follow and learn one scaling coefficient per layer \alpha_1,...,\alpha_{|L|} . We then examine the performance after fine-tuning these coefficients.
| Method | EuroSAT | DTD | GTSRB | SVHN |
|---|---|---|---|---|
| +7.93 | -1.44 | +4.70 | -15.98 | |
| +10.00 | +1.21 | +6.80 | +10.52 |
There is a significant gain when fine-tuning a model that has undergone re-basin using our approach, represented as . In contrast, fine-tuning from (which does not involve permutation) produces inferior results. This underscores that rebasing and fine-tuning should not be viewed as mutually exclusive but as complementary strategies.
Knowledge composition using task vectors with learned anisotropic scaling. NeurIPS 2024.
More experiments on different model sizes
We repeated the tests of Tab.1 using the CLIP ViT-L/14 models pretrained on "laion400m_e31" () and "commonpool_xl_clip_s13b_b90k" ().
| EuroSAT | EuroSAT-Supp | DTD | DTD-Supp | GTSRB | GTSRB-Supp | SVHN | SVHN-Supp | |
|---|---|---|---|---|---|---|---|---|
| 67.37 | 88.66 | 64.94 | 88.66 | 60.15 | 88.66 | 68.53 | 88.66 | |
| -4.07 | -0.33 | +0.74 | -0.36 | +0.70 | -0.93 | +2.04 | -3.73 | |
| Ours | +1.15 | -0.01 | +0.37 | -0.20 | +2.29 | -0.13 | +5.33 | -0.73 |
The results align with those reported in the paper.
Computational Overhead
For an extensive complexity analysis, please refer to our response to gzgf. We highlight two additional aspects:
- By separating multi-head attention alignment into inter- and intra-head, we significantly reduce scaling issues related to width. For example, in CLIP’s ViT-B/16 (12 heads), the search space is reduced from to . For a ViT-L/14, GiT Re-basin’s space grows to , while TransFusion’s remains much smaller at .
- The weight-matching procedure depends only on and , and not on the downstream tasks to be transferred. This ensures that associated costs can be incurred only once. In practice, the entity responsible for releasing the updated model can precompute and distribute the permutation matrices, eliminating the computational burden for end users.
Functional role
Intuitively, encoding semantic similarity would require optimizing permutations based on the activations of each layer, which is hard in a data-free scenario like ours. However, if few data are available, one can leverage our decoupled inter- and intra-head approach and replace our energy-based metric with one crafted on activations discrepancy.
Nevertheless, while singular values do not directly encode high-level semantics, they capture structural properties. The largest singular value of a linear transformation corresponds to its Lipschitz constant; hence, aligning attention heads with similar distributions of singular values implies aligning heads with comparable sensitivity to input perturbations. Moreover, the spectrum of singular values indicates how information is propagated: a low-rank attention head typically focuses on a narrow subspace of the input, whereas a full-rank head affects a broader range of directions.
Hyperparameters
Considering , it is optionally scaled by a hyperparameter fixed at for all experiments, due to our assumption of no access to data, which precluded further tuning. To assess sensitivity to , we kindly refer to Fig. 4 (main paper) -- applying TransFusion leads to increased robustness across various choices of .
Residual connections
We remark that our approach to handling residual connections is not specific to transformers but is applicable to any residual block. Indeed, residual connections assume an implicit identity mapping , where represents the identity matrix. However, once the permutations are applied to all layers, this assumption no longer holds, resulting in . As shown in Fig. 5, this deviation is sufficient to break the functional equivalence of the model.
Backbones for NLP tests
We apologize for the omission; please refer to our response to S7w3.
Thanks for the detailed rebuttal. It would be great if authors are able to incorporate these additional results into the paper. Since I have no further concerns, I will increase my score.
The paper investigates the question of whether it possible to “upgrade” a fine-tuning of a base model to a new upgraded base model, without having to fine-tune again, by relying instead on task vectors. The authors show that this is possible by adapting ideas from model re-basin and improving it specifically for transformers by carefully handling attention heads and residual connections, showing promising empirical results on vision and language tasks.
The reviewers all agreed that the problem and proposed solution are of significant interest and I thus recommend acceptance.