Initializing Variable-sized Vision Transformers from Learngene with Learnable Transformation
摘要
评审与讨论
This work proposes Learnable Transformations (LeTra) for improving learngene-based model initialization. In particular, a set of width transformations is learned to produce weight matrices of varying dimensions, and a set of depth transformations is learned to change the number of layers in the model. In the learngene learning stage, an auxiliary model is distilled from a large model, so that the learngene and the transformations can be optimized. After that, a set of variable-sized models can be initialized based on the optimized learngene and the transformations, which can serve as good starting points for efficient fine-tuning. Experiments on various image classification datasets including ImageNet and CIFAR with the Vision Transformer (ViT) architecture demonstrate the efficacy of the proposed method.
优点
-
It is intuitive to apply both depth transformations and width transformations to learngene when constructing new models. This work shows the benefits of combining both transformations.
-
The empirical results on various image datasets including ImageNet shows that LeTra can provide strong model initialization weights, significantly outperforming prior methods TLEG and Grad-LG.
缺点
-
[Presentation] The description of the proposed approach is not very clear:
-
Overall, if we consider the ViT weights as a set of matrices, each width transformation linearly transforms the rows and columns of a weight matrix, and the depth transformations construct new layers by linear combining layers from the learngene. Section 3 could be improved by highlighting the core idea behind the two types of transformations and simplifying the notations.
-
Section 3.3 is not easy to follow because the description is a bit vague. For instance, it is not specified how to choose the start and step size in "step-wise selection." The difference between "random selection" and "random selection (wo)" is also unclear.
-
The caption of Figure 3 could include more details.
-
-
[Models larger than Aux-Net] The proposed approach learns width and depth transformations during Stage 1. However, such transformations cannot "extrapolate," or in other words, we cannot build descendant models that are deeper or wider than the Aux-Net. This somehow limits the application scenarios.
-
[Learngene training costs] When comparing LeTra with baselines, it would be helpful to also include the training costs of Stage 1 in which a more complex set of learngene + learnable transformations is optimized.
问题
- [Transformation selection] In addition to the selection strategies introduced in Section 3.3, there could be other heuristics that consider the importance of each weight row/column, and select them accordingly. For instance, would it be helpful to rank the weights by magnitude and select the largest rows/columns?
局限性
The authors have adequately discussed the limitations and potential societal impacts.
[Presentation]
1) The description of transformations and notations.
Thank you for pointing this out!
In the revision, we will simplify the notations of Section 3 and highlight the core idea behind the two types of transformations.
2) How to choose the start and step size in "step-wise selection".
In practice, we usually choose the starting row as 1 and calculate the step sizes as and .
3) Difference between "random selection" and "random selection (wo)".
For random selection, we first randomly generate a set of indices, based on which we select corresponding rows/columns. During the selection process, we ensure that selecting the same rows/columns for all transformation matrices.
In the case of random selection (wo), as opposed to random selection, we opt to select distinct rows/columns for each transformation matrice (L238-239).
4) Caption of Figure 3.
In the revision, we will add more descriptions in the caption of Figure 3 to avoid any misinterpretation.
[Models larger than Aux-Net]
Please see the relevant discussions in G1 of General response.
[Comparisons between LeTra and baselines in training costs]
Thank you for pointing this out!
Compared to Scratch-1, LeTra reduces around 3.6× total training costs (12100 epochs versus 320+121 epochs), where "320" means the training epochs of stage 1.
Additionally, in comparison to Scratch-1, we calculate GPU hours for training 12 Des-Nets and find that LeTra significantly reduces total training GPU hours by approximately 7× (1322 GPU hours for Scratch-1 versus 190 GPU hours for LeTra).
[More choices of transformation selection]
Thank you for your insightful question!
We rank the weights by magnitude and select the largest rows/columns as the target transformation matrices, a strategy we refer to as "rank". According to the table below, we observe that "rank" achieves performance comparable to LeTra (continuous selection), thereby validating the robustness of our trained transformation matrices.
| Model | Params(M) | FLOPs(G) | Scratch (100ep) | LeTra | step-wise | rank |
|---|---|---|---|---|---|---|
| Des-H8-L13 | 42.0 | 8.6 | 78.1 | 80.0 | 79.6 | 79.8 |
Dear Reviewer p972,
Thank you very much for your constructive comments on our work.
We have made every effort to address the concerns raised. If any aspect of our response is unclear or requires further clarification, please let us know. If everything is clear, we kindly ask if you could consider improving the score.
Thank you very much for the clarification and the new insights on different model sizes. Most of my previous concerns are addressed. I would like to raise my rating.
Thank you very much for raising the rating!
We'd like to express our sincere gratitude for your thoughtful and thorough review of our paper.
This work builds on top of a learning paradigm called Learngene (introduced in an earlier work), which focuses on providing effective initializations for training multiple target models of different sizes. In this paradigm, a compact module, referred to as learngene, is first leaned from a large well-trained network, and then learngene is transformed to initialize different models of varying size. This work introduces a set of learnable transformation parameters into the learngene paradigm and trains the compact learngene module and the transformation parameters simultaneously. The learngene module and transformation parameters are trained by using them to create an auxiliary network which is trained using distillation from a large well trained model. Once trained, the transformation parameters can be used to transform learngene module parameters in order to initialize target networks of different sizes.
Experiments are conducted using ImageNet and multiple other transfer learning datasets and the results show that the proposed approach improves training compute efficiency of target models significantly when compared to training from scratch.
优点
The paper focuses on the problem of obtaining effective target model initializations cheaply (with less compute). This is a practically useful problem.
The experimental results show that the proposed initialization significantly improves training efficiency when compared to training from scratch.
缺点
Presentation: Presentation of the paper can be improved.
-
Overall, I feel that sec 3.1 has a lot of notations and all of them are not strictly necessary to clearly describe the method. I encourage the authors to think about simplifying this section.
-
Fig. 3(a) is confusing: Once we have chosen F_{l,in}^des and F_{l,out}^des (by selecting appropriate rows from F_{l,in} and F_{l,out}), the multiplication of three matrices F_{l,in}^des x W_l^LG x (F_{l,out}^des)^T should give the weights for the destination network layer. I do not understand why there are additional insert steps in Fig. 3(a). These steps do not seem to match with Eq. (1). Which process is actually used when performing width modifications? Is it Eq.(1) or the process depicted in Fig 3(a)? If it is the one in Fig. 3(a), is the same process used for constructing auxiliary network during training?
-
Small correction: I think the size of u_{ij} in line 158 should be D_in^L x D_in^L instead of D_in^aux x D_out^aux since u_{ij} is multiplied with the weights W_j^L of j^th learngene layer which is of size D_in^L x D_out^L.
Experimental results:
-
Some things are unclear to me:
- What is the difference between scratch-1 and scratch-2 trainings for one model?
- What is the teacher network used for distillation when training auxiliary network?
-
Comparison missing with highly relevant alternative strategies:
- Based on my understanding of the selection process, the final parameter matrices of the DesNet are effectively sub-matrices of the parameter matrices of the first L^des layers of the trained auxiliary network. Such a selection process can be applied to any pretrained transformer network. For example, one can train a standard ViT of the same size as the auxiliary network using distillation with the same teacher and then directly sample weights from this trained model to initialize Des-Nets. Initializing smaller transformer models from larger pretrained transformer models has been recently studied in [36].
- Another relevant approach is which is not mentioned in the paper is "MatFormer: Nested Transformer for Elastic Inference".
-
Unfair comparison with from scratch training:
- This following statement in line 259-261 is incorrect: "Compared to Scratch training, LeTra reduces around 3.6× total training costs (12×100) epochs vs. 320+12×1 epochs)." Comparing training costs simply in terms of number of epochs is not meaningful. LeTra trains a larger auxiliary model (H14-L15) and also uses distillation for training which requires running a teacher model during training. So, 1 epoch training of LetTra auxiliary model will be significantly costlier than one epoch training of any of the smaller Des-Net models.
- In terms of performance comparison, it is unfair to compare LeTra that uses distillation directly with training from scratch without distillation.
-
Experimental results presented only on Des-Nets whose size is in between Auxiliary network and Learngene network. How effective is the initialization for Des-Nets which are smaller than learngene?
问题
Some things are unclear from the paper. Please see the questions in 'weaknesses' section.
Comparisons with some highly relevant approaches are missing in the paper.
局限性
No specific negative societal impact.
[Presentation]
1) Simplification of notations in Sec 3.1.
Thank you for pointing this out! In the revision, we will simplify the notations of Sec 3.1 to describe the method more clearly.
2) Initialization process in Figure 3.
Thank you for raising this confusion!
During the first training stage of LeTra, we construct the Aux-Net according to Eq.(1).
For the second initialization stage of LeTra, we first select certain rows/columns from well-trained to form target matrices . Subsequently, we use to perform matrix multiplication with learngene matrices. The multiplication results are then inserted into original learngene matrices to derive .
Similarly, we select certain rows/columns from well-trained to construct target matrices . Then we use to multiply and insert the multiplication results into to obtain .
Thus, the first training stage and the second initialization stage of LeTra are not inherently interconnected. In the revision, we will add these descriptions in the caption of Figure 3 to avoid any misinterpretation.
[Experimental results]
1) Difference between Scratch-1 and Scratch-2.
Scratch-1 involves training models from scratch with sizes identical to those used in LeTra (L544-545). Scratch-2 entails training models from scratch with sizes similar to those in Scratch-1 because these models only vary in depth (L546-547). Additional model details can be found in Table 5.
2) Teacher network.
We choose LeViT-384 [48] as the teacher/ancestry model (L536-537).
3) Comparison with IMwLM [36] and Matformer [i].
We employ DeiT-base distilled as the larger pretrained model for IMwLM [36]. For Matformer [i], we refer to the results presented in Figure 4(a) of their original paper. From Figure I of our uploaded PDF, compared to IMwLM [36] which initializes smaller models from larger pretrained ones and Matformer [i], we observe that LeTra achieves superior performance. Notably, LeTra demonstrates the capability to initialize models whose sizes are independent of larger pretrained models.
[i] Kudugunta, Sneha, et al. "Matformer: Nested transformer for elastic inference.", NeurlPS 2023.
4) Comparison with scratch training in training costs.
Thank you for pointing this out! In comparison to Scratch-1, we calculate GPU hours for training 12 Des-Nets and find that LeTra significantly reduces total training GPU hours by approximately 7× (1322 GPU hours for Scratch-1 versus 190 GPU hours for LeTra).
5) Performance of Des-Nets without distillation in the first stage.
Firstly, it is important to note that we did not employ distillation during the fine-tuning of Des-Nets in the second stage.
Secondly, we omit the distillation process in the first stage and present the performance of Des-Nets in Figure II of our uploaded PDF.
While the distillation process in the first stage marginally enhances the final performance of Des-Nets, the most substantial performance improvement arises from our proposed initialization process using LeTra.
6) Performance of Des-Nets which are smaller than learngene.
Please see the relevant discussions in G1 of General response.
I thank the authors for their rebuttal. After reading the rebuttal, I upgraded my rating to borderline accept. I still think that the paper needs significant changes in terms of writing to make several things clearer.
Dear Reviewer 8CMe,
We would like to express our sincere gratitude for your thoughtful and thorough comments.
We have made every effort to address the concerns raised. If any part of our response remains unclear or requires further clarification, please let us know.
To avoid unafforadable trainining cost, a new training paradigm like Learngene framework is proposed. Unlike previous work that mainly focus on the depth, the authors proposed Learnable Transformations, which is able to adjust the learngene module along both depth and width dimension for flexible variable-sized model initialization. Experimental results indicate that the proposed method is able to achieve strong performance in one 1 epoch fine-tuning.
优点
-
A new framework called LeTra is proposed, capable of transforming the learngene module along both depth and width.
-
Extensive experiments demonstrate that the proposed method achieves promising performance, even with just one epoch of tuning.
缺点
As shown in Table 4, training from scratch consistently improves performance with increased depth, but almost no improvements are observed for the proposed method. Could the authors explain this in more detail?
How much does distillation contribute to the final performance? Is most of the performance gain from distillation during fine-tuning?
Could the authors clarify the results in Table 1? From my understanding, even when initializing a model from pre-training, performance would significantly drop without any fine-tuning.
问题
Figure 1 (d) raises questions. It is commonly believed that depth is more crucial than width in deep networks. However, this work suggests that varying width can achieve better performance than increasing depth. Could the authors elaborate on this finding?
局限性
See above.
1) Performance improvements with increased depth.
Thank you for pointing this out!
In Table 4, we present the results of LeTra with 2-epoch tuning (L315), which aims to demonstrate the effectiveness of our proposed depth transformation rather than performance improvements with increased depth.
Interestingly, during the early training phase (e.g., 2 epochs), model performance does not consistently improve with increased depth, whether using Scratch that trains randomly-initialized models or LeTra that trains learngene-initialized models. To validate this claim, we provide the results of Scratch with 5-epoch training and LeTra with 2-epoch tuning below.
Furthermore, extending the training of LeTra-initialized models to more epochs reveals a consistent improvement in model performance with increased depth.
| Model | Params(M) | FLOPs(G) | Scratch (5ep) | Scratch (100ep) | LeTra (2ep) | LeTra (20ep) |
|---|---|---|---|---|---|---|
| Des-H6-L13 | 23.8 | 5.0 | 10.4 | 76.4 | 78.1 | 79.3 |
| Des-H6-L14 | 25.6 | 5.3 | 10.7 | 76.5 | 78.0 | 79.5 |
| Des-H6-L15 | 27.4 | 5.7 | 10.0 | 77.1 | 78.2 | 79.8 |
2) Contribution of distillation to the final performance.
Thank you for your nice concern!
We remove the distillation process in the first stage and present the performance of LeTra in Figure II of our uploaded PDF.
Comparing "LeTra (without distillation in first stage) (5 epoch)" with "LeTra (with distillation in first stage) (5 epoch)", we observe that the distillation process in the first stage has a marginal improvement on the performance of Des-Nets.
Additionally, comparing "LeTra (without distillation in first stage) (5 epoch)" with "Scratch (5 epoch)", we find that LeTra's proposed initialization process significantly enhances Des-Nets' performance.
Consequently, we can safely conclude that the most substantial performance gains stem from our proposed initialization process using LeTra rather than the distillation process in the first stage.
Furthermore, it is important to note that we did not employ distillation during the fine-tuning of Des-Nets in the second stage.
3) Explanation of Table 1.
We acknowledge the necessity of retraining a specific task head when transferring well-trained parameters from task A to task B. However, in Table 1, both task A and task B for all baselines and LeTra involve ImageNet-1K. Therefore, we also inherit the classification head parameters from either the first stage or the pre-training stage to initialize the Des-Nets for ImageNet-1K.
4) Discussion about importance of depth and width for deep networks.
Thank you for your insightful question! While it is widely accepted that depth significantly impacts deep network design (as shown in the left table below, where model performance increases with depth), we empirically find that configuring width could also enhance model performance (as shown in the right table below). This discovery motivates us to explore transforming learngene across both depth and width dimension. Furthermore, recent studies (see [i]) emphasize the importance of simultaneously considering both width and depth dimensions in neural network design.
[i] "The shaped transformer: Attention models in the infinite depth-and-width limit.", NeurlPS 2023.
| Model | Params(M) | FLOPs(G) | Scratch (100ep) | Model | Params(M) | FLOPs(G) | Scratch (100ep) | |
|---|---|---|---|---|---|---|---|---|
| Des-H12-L7 | 51.0 | 10.3 | 76.5 | Des-H7-L12 | 29.9 | 6.2 | 76.4 | |
| Des-H12-L8 | 58.2 | 11.7 | 77.2 | Des-H8-L12 | 38.8 | 8.0 | 77.7 | |
| Des-H12-L9 | 65.3 | 13.1 | 78.0 | Des-H9-L12 | 49.0 | 10.0 | 77.8 | |
| Des-H12-L10 | 72.4 | 14.6 | 78.2 | Des-H10-L12 | 60.3 | 12.3 | 79.0 | |
| Des-H12-L11 | 79.5 | 16.0 | 79.0 | Des-H11-L12 | 72.9 | 14.8 | 78.4 | |
| Des-H12-L12 | 86.6 | 17.5 | 79.6 | Des-H12-L12 | 86.6 | 17.5 | 79.6 |
Dear Reviewer i78Y,
We greatly appreciate the concerns provided and have made every effort to address all the points raised.
Is there any unclear point in our response that needs further clarification?
Dear Reviewer i78Y,
The author's rebuttal discussion is ending soon. Can you respond to the authors' reply and see if it addresses your concerns or if you would maintain your rating? Thank you for your time and effort.
Best,
Your AC
This research adopts the learngene learning paradigm; the core idea of learngene is to transform a well-trained ancestry model (Ans-Net) to initialize variable-sized descendant models (Des-Net). The authors pointed out two limitations of previous works: (1) the original learngene paradigm lacks the provision of structural knowledge which is not favorable for later transformation and (2) existing strategies to craft Des-Net are not learnable and they overlook the width dimension.
To address those limitations, the paper proposed LeTra, standing for Learnable Transformations:
-
Learnable transformation parameters with structural knowledge in Ans-Net that later facilitates descendant transformation. To this end, they train an auxiliary model (Aux-Net) to learn the transformation from Ans-net, where and are learnable and structured matrices respectively responsible for depth and width dimensions.
-
Given different size of target Des-Net, some columns/rows of and will be selected to construct the transformation . Ans-Net is then transformed using to initialize the Des-Net
The authors conducted extensive experiments, showcasing the advantages of LeTra, not only in terms of performance but also in terms of fine-tuning speed.
优点
-
The paper presentation is very clear and easy to understand. I enjoy reading it.
-
The motivation is clear and convincing. Extensive experiments validate the effectiveness of the proposed LeTra, showcasing the benefits of learnable transformation.
缺点
-
The choice of Des-Nets is limited as those can be seen as sub-nets of the pretrained Aux-Net, i.e. one cannot scale Des-Net bigger than Aux-Net
-
Downstream experiments were only conducted on small classification datasets.
问题
Those are related to the weaknesses above:
-
Is it possible to combine LeTra with other Learngene strategies to overcome the scaling limitation?
-
May the authors consider more complex downstream task like semantic segmentation or object detection?
-
Given the strong results of LeTra even without any fine-tuning (c.f. Table 1), I'm curious of the linear probing results, i.e. fine-tuning only a simple head for the downstream task; that protocol is common in self-supervise learning
局限性
I believe this is a solid work with significant contributions suitable for the conference. However, I have a few questions and suggestions that I would greatly appreciate being addressed. My current recommendation is quite positive.
1) Scaling Des-Net bigger than Aux-Net.
Please see the relevant discussions in G1 of General response.
2) Combination with other Learngene strategies.
Thank you for your insightful question!
We could combine LeTra with other Learngene strategies. For instance, we could replace the depth transformation strategy of LeTra with linear expansion strategy proposed by TLEG [23].
3) More complex downstream tasks.
Thank you for your nice concern!
In the revision, we are committed to expanding our experimental scope to include more diverse tasks and datasets.
4) Linear probing results.
As shown in the table below, we can observe that LeTra achieves better performance than Pre-Fin under the linear probing protocol. For example, LeTra outperforms Pre-Fin by 3.43% and 1.86% on CIFAR-100 and CIFAR-10 with Des-H12-L12.
| Des-H12-L12 | Pre-Fin | LeTra |
|---|---|---|
| cifar100 | 72.08 | 75.51 |
| cifar10 | 90.08 | 91.94 |
General response
G1. Size diversity of Des-Nets.
We appreciate the valuable comments regarding the size diversity of Des-Nets, e.g., scaling bigger than Aux-Net or smaller than learngene.
Firstly, we would have to emphasize that the primary focus of this paper is on initializing variable-sized models from learngene using well-trained transformations, rather than simply scaling model sizes beyond a certain threshold (e.g., Aux-Net).
Secondly, it is important to note that we can initialize models whose sizes are independent of both learngene and Aux-Net. In our empirical setups, the size of Des-Nets ranges from 29.9M to 109.8M parameters, encompassing those of Aux-Net (78.6M), as shown in Figure 4. Furthermore, we can initialize Des-Nets with parameters smaller than those of learngene, as well as Des-Nets deeper and wider than Aux-Net (H14-L15), utilizing well-trained learngene and transformations:
-
For Des-Nets smaller than learngene, we directly select rows/columns from well-trained learngene matrices using our proposed selection strategies to initialize the target matrices of Des-Nets.
-
For Des-Nets deeper and wider than Aux-Net, we first select rows/columns from well-trained transformation matrices using our proposed strategies, then integrate these selections into the original transformation matrices to achieve the desired size. Subsequently, we employ these expanded matrices to transform learngene for Des-Net initialization.
As shown in Figure I of our uploaded PDF, LeTra could flexibly and efficiently initialize variable-sized models that are independent of the sizes of learngene and Aux-Net.
The paper presents an improved learngene strategy that simultaneously trains a set of learnable transformations and compact learngene modules for subsequent model initialization.
The paper initially received mixed reviews. While the reviewers considered the method well-motivated (JzzD, p972), novel (i78Y) and achieved good results on multiple datasets (JzzD, i78Y, 8CMe, p972), they also raised several major concerns on the paper: 1) lack of clarity in presentation and method details (8CMe, i78Y, p972); 2) restriction on the generated model sizes (JzzD, 8CMe, p972); 3) missing comparisons with related work, e.g., MatFormer (8CMe); 4) unfair or missing comparison settings (JzzD, i78Y, 8CMe,p972). The author's rebuttal provided further clarifications, additional comparisons, and empirical results, which addressed most of the concerns from Reviewer JzzD, 8CMe, p972, including 1)-4). During the post-rebuttal discussion, Reviewer 8CMe and p972 raised their ratings to the positive side, while Reviewer i78Y remained unconvinced by the experimental evaluation.
After reading the paper, the rebuttal and the reviewers' comments, the AC tended to agree with the majority of the reviewers. While the paper still has room for improvement, such as lacking experiments on larger benchmarks, the rebuttal enhanced the experimental validation with clear evidence and explanations. Overall, given the additional results, the effectiveness and improvement of this work over the prior art are convincingly demonstrated, and therefore the AC recommends it for acceptance. The author should take into account the rebuttal and the reviewers' feedback, especially regarding the presentation and the added results, when preparing the final version.