Features are fate: a theory of transfer learning in high-dimensional regression
摘要
评审与讨论
The paper studies an analytical model for transfer learning in deep linear networks. To begin, the authors prove that general wisdom surrounding feature learning, namely that being close in distribution between the source and the target domains, is not a necessary condition for transfer learning. The authors study deep linear networks trained using gradient flow and establish a measure of transfer, namely the transferability, which is the difference between the generalization error of the model trained from scratch and the one that has undergone transfer learning. The authors study the transferability phase plots as a function of source-target alignment and the data to parameter ratio for linear transfer, meaning just the readout layer is trained on the target task, and fine-tuning, where all the layers are trained on the target. Finally, the authors show empirically that their predictions hold in the case of a 2 layer relu network.
优点
The paper is very well written in general and the result is novel and interesting. The model is able to capture transfer learning both just by fine tuning on the last layer, and over the whole parameters. The authors validate empirically their results on regression tasks.
缺点
The paper is generally well written and understandable, and there aren’t really any major flaws that I could find. While one may consider the very simplified setting as a weakness, especially since in deep non-linear networks and Transformers the transfer learning behaviour might be different, I believe that such models would currently be too complicated to study. While I do not find any technical flaw with the paper, I do have several questions that I detailed below.
问题
Can this framework also explain what happens in the case of gradient descent (rather than gradient flow)? As a follow up, what would be the effect of noise in stochastic gradient descent?
While I do understand and agree with Theorem 2.2, it seems to go against prior literature showing that these metrics correlate well with transfer. Would it be possible to empirically test the claim? More concretely, would it be possible to show an experiment where the source and target distributions are “far apart” with respect to the stated metrics, yet the transfer achieved is still positive?
Can these results also be extended to logistic regression in a classification setting?
Dear Reviewer n1uD,
We are happy to hear that you found the work to be readable and well-written. Thank you for your thoughtful questions– we address them below:
- We believe that this framework would also describe models trained with other optimization algorithms such as finite size gradient descent or SGD. The challenge lies in describing their dynamics precisely. For this reason, we focused on the analytically tractable setting of gradient flow. However, similar theorems on the global convergence of gradient descent for finite step size are proven in “Global Convergence of Gradient Descent for Deep Linear Residual Networks” (Wu 2019). The main takeaway is that with a sufficiently small learning rate the network converges to a global optimum, in which case our results would hold exactly as written in the paper.
- We agree that Theorem 2.2 refutes prior work, which is why we find it so surprising! In Fig. 5 (Appendix E), we plot the KL divergence and Wasserstein Metric for the source and task distributions in our theory. If these metrics were predictive of transfer performance, the correlation would be negative, which is not the case. We have also performed similar experiments in 2-layer nonlinear networks. Here we construct two teacher networks with the same feature space but different output weights so that the Wasserstein distance grows, but transfer performance remains constant. We can include this plot in the appendix to address this concern
- Since the focus of our paper is regression, the extension to classification is outside the scope of the current work. However, it is interesting to consider how the results may generalize. Consider a classification task defined by class conditional distributions that are gaussians with equal covariances and means , . Then the optimal classifier learns the discriminator vector . In this way, the problem is very closely related to the regression problem we consider. The features should be defined by the two vectors and , which define the separation between the means of the gaussians in the source and target task respectively. Then the analog of the feature overlap is as in our current work. We predict that, as in the paper, this overlap will control the transferability. This could be the subject of future work and we thank you for pointing out the extension.
I thank the reviewers for their reply. I am still a bit confused about Theorem 2.2 and trying to understand it. Please correct me if I'm wrong, my current understanding is that the authors are claiming the following.
For any living in some space (i.e. can be written as a linear combination of some orthonormal basis functions ), you can always find a living in the same space (can also be written as a linear combination of the same basis functions) such that the distance between the distributions generated by and is large. So far I can see this result, and it seems intuitively true (just pick completely different coefficients). However, I cannot immediately see the connection to the transfer learning notion.
If a neural network is able to learn from the source data (in this case generated by ) - I guess intuitively thinking of it as the neural network being able to "learn the coefficients" of (very loosely speaking), why should the network perform well on given that it has learned an approximate representation for , and is far from (in Dudley metric or KL divergence)?
Dear Reviewer n1uD,
Thank you for your insightful question. To clarify how Theorem 2.2 connects to transfer learning, consider a pretrained model that has been trained on the dataset defined by the function . If we want a predictive metric for transfer learning based on the datasets themselves, it should correlate negatively with transfer performance, i.e. transfer is better for datasets that have smaller distance. Assume that pertaining is successful, in the sense that the trained model learns a representation of the function . Under the mild architecture assumption that the model contains a final linear layer that maps from to , it must be true that the space spanned by the features is contained in the space spanned by the activations of the model before the final readout layer. In fact we prove in Theorem 3.4 that for linear networks this is the case and that the space spanned by the final hidden layer activations is equal to the space spanned by pretraining task function. Now consider the entire space of downstream tasks. In our framework this corresponds to the set of for any . A subset of tasks for which transfer learning will be successful are those that live in the subspace spanned by the , since the network only needs to update its final layer weights in order to represent . One can make a Rademacher complexity argument here to argue that with finite data, learning the output weights with the correct features will be more successful that learning new features and output weights. Therefore, the subspace of tasks that we consider in Theorem 2.2 will transfer well. But we prove in the theorem that one can always find a task in this subspace for which these dataset based metrics are arbitrarily large. Therefore, it is always possible, mathematically, to find a counterexample to the notion that distant distributions should not transfer well. We hope that this explanation clarifies any confusion.
This paper develops a theoretical framework to analyze transferability in high-dimensional regression tasks within the context of transfer learning. The authors adopt a feature-based approach, proposing that the similarity between source and target tasks is best understood through feature space overlap rather than distributional similarity. The paper establishes phase diagrams that quantify the conditions under which transfer learning outperforms training from scratch, particularly when feature space alignment is high and target data is limited. The analysis primarily focuses on deep linear networks but extends some insights to nonlinear networks.
优点
This paper addresses the essential and meaningful topic of understanding transferability in transfer learning. The problem setup, including assumptions, is clearly and concisely presented.
缺点
The main theoretical results depend on assumptions like normally distributed inputs and linear source and target functions, which limit applicability. Nonetheless, those simplifications are reasonable for an initial exploration of this topic.
问题
-
Equations (1) and (2) use the same . Does this imply that the source and target outputs for a given input share the same noise component?
-
In Equation (6), could the authors clarify the interpretation of ?
-
In line 174, what is the meaning of the symbol ?
-
Theorems 3.4 and 3.5 are referenced from Yun et al. What is the motivation for re-proving these theorems, and why are they prominently highlighted in the main text?
Dear Reviewer vks9,
Thank you for your review of our paper. We agree that normally distributed data and linear source and target functions are a major simplification from realistic transfer learning scenarios. However, we chose this simplified setting in order to have an analytically solvable model. We note, however, that the results on gradient flow dynamics still hold for gaussian data with arbitrary covariance, and the generalization error in this setting is the subject of existing work (for example: “A Theory of High Dimensional Regression with Arbitrary Correlations”, (Mel Ganguli 2021)), so this assumption can likely be relaxed. We also explore nonlinear source and target functions in Section 4 when we discuss student teacher ReLU networks. Finally, we direct you to the paper “Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution” (Liang 2022) which demonstrates the importance of the learned feature space for out of distribution generalization empirically in a number of realistic settings. Additionally, we answer your questions below:
- Thank you for pointing this out— the notation is not clear. Equations 1 and 2 should be written with and to make it clear that these are independent noise realizations for the source and target tasks. In our proof we consider the label noise to be drawn independently from the same Gaussian distribution in the source and target task, but the extension to Gaussian distributions with different variances is straightforward and our proofs will still hold with only minor adjustments.
- In Equation 6, is the parameter in the full set of parameters
- Thank you for pointing this out, it is a typo. The subscript should follow the norm, to denote the Lipschitz norm of the function.
- Although these theorems are simply extensions of those found in Yun et. al., they are essential components of our argument. We felt that citing them as important building blocks and reproducing the proofs in the main text makes our paper accessible and self-contained for readers who may not be familiar with the literature on deep linear networks. In addition, we extend their results on the convergence of gradient descent to underparameterized networks in Appendix B.4.
Thank you for the response. I keep my original rating.
The paper theoretically studies transfer learning for deep linear networks in the feature-learning "rich" regime, in the joint scaling of large number of fine-tuning samples and dimension (including width) . The pertaining is performed on the population loss to simulate the fact that the source task has to be larger than the fine-tuning one.
The authors first argue that the model-agnostic task-similarity metrics (based solely on the source and target distributions) are insufficient to capture how good the transfer is and find counter examples where it can be the case. Then, they propose a new metric based on the discrepancy between the performances of the fine-tuned model and the trained-from-scratch model.
Given this metric, they compute the network performances in the asymptotic limit described above, both in the case of linear fine-tuning and "full" fine-tuning of all the parameters.
优点
-
The results presented, especially Theorem 3.7, 3.8, and 3.9 are, to my knowledge, novel and relevant to the development of the theory of transfer learning in neural networks. Although in the linear setting, these results are non-trivial to obtain since the authors operate in the rich regime, which corresponds to a non-convex optimization problem from an optimization perspective (all the weight matrices move from initialization in the limit).
-
The paper is overall very well-written. The results are nicely presented, and the storyline is logically flawless. I especially appreciated how results from existing work are adequately presented, such that the novel theorems feel tied into the existing literature. Finally, the results are adequately discussed and put into context (e.g. Theorem 3.6).
缺点
The main fundamental weaknesses of this paper in my opinion lie in how certain results are portrayed, which can be quite misleading. More concretely:
-
First of all, I think studying linear networks in the specific context of transfer learning is an inherent limitation that should be more clearly discussed. This is because the function class is linear no matter how many hidden layers. In fact, one could consider a simple linear regression model (for which the asymptotic analysis is known, as the authors correctly cite), and then perform linear evaluation on fresh new weights. This is how I interpret Theorem 3.5 and 3.6, where there is no dependence on the depth.
-
I find it a bit misleading to consider as misleading the dataset-based discrepancy metrics based on the “anomalous positive transfer”. In this case, as the authors state, the positive transferability comes purely from the double descent phenomenon and not from anything inherent positive feature of fine-tuning. In a sense, the benefits of pretraining come purely from the additional datapoints that the model is pre-trained on and that makes you avoid double descent. In this sense, I feel that the message stated in the abstract that model-agnostic similarity metrics are insufficient, while the proposed feature-centric metric is portrayed as the solution, is misleading: the proposed metric has precisely this flaw that does not take into account the double descent phenomenon. To give an example, when the source and target distributions are exactly the same, the metric would still present this anomaly. Thus, comparing with the trained-from-scratch model is not enough to evaluate transferability.
-
The results in Section 3.8 are also incomplete. How does the measure of transferability change? I appreciate the result that in general, the transfer is worse (and this is a nice result per se), but it would be nice to tie it to the previous section, i.e. comparing it to the trained-from-scratch model. I would imagine that the picture in Figure 1 (b) would be different, as regularization makes training from scratch in general a bit better by avoiding double descent. Again, in this case, I would imagine that the dataset-based discrepancy metrics correlate with transferability. More generally, I wonder how much of these conclusions would hold excluding the double descent phenomenon, or if what we are observing is entirely due to it. If this is the case, then advocating for a feature-centric view of transfer learning is misleading.
-
Figure C requires better formatting.
Minor issues/questions:
-
What is M, line 161?
-
Dudley Metric not introduced
-
Assumption (22) should be explained in the main text, or at least referenced because initialization is not really discussed there.
问题
See weaknesses.
- Thank you for pointing this out. You are correct that in the regime of “anomalous positive transfer” the benefit of pre-training comes solely from having more data in the source task– we will make this point clear in our revised manuscript. While we do make the point that simple data-dependent metrics do not predict transfer, we do not foresee measuring feature space overlap directly as an “alternative” metric, but rather as a theoretical guidepost for future algorithm development. Although we believe that this quantity (defined as the norm of the projection of the target function into the orthogonal complement of the pretrained RKHS) would correlate with transfer performance, it is impossible to calculate before pretraining when the ground truth target function is not known. However, one can estimate this quantity given a pretrained model and samples of the data. For the two models we consider, this quantity can be computed exactly given the functions and (see Appendix C for the procedure in the 2-layer ReLU network). The main idea we are trying to advance is that similarity between tasks should be defined with respect to the features of the model, not with respect to the datasets themselves. We agree that the double descent phenomenon obscures this point a bit. Since the double descent occurs in the scratch-trained model, it could indeed be avoided by regularization. Whether the appropriate scratch-trained model to compare to should include regularization or not is an interesting philosophical question. In our work we adopt the perspective that the correct model to compare is one trained using the same algorithm as in pretraining. In our setting that is unregularized gradient flow, which for deep linear networks will converge to a representation of the minimum interpolator, hence the double descent. Our theoretical techniques to describe gradient flow rely on the absence of weight decay (the conserved quantity Eq. 35 is no longer conserved if we add weight decay) so we do not include it in order to make rigorous statements about the dynamics. However, in our revised draft, we will include a numerical simulation that demonstrates that anomalous positive transfer is eliminated if the scratch network is optimally regularized.
Dear Reviewer YzFe,
Thank you for your thoughtful review of our paper. We are very happy to hear that you felt our manuscript was well written and “logically flawless”. In addition, we thank you for pointing out sections of the work that felt misleading. We address these points in the following comments:
- We agree that deep linear networks are a particularly simple setting, but believe that much of the phenomenology we rigorously demonstrate in these networks still holds for arbitrary architectures. The overall picture is the following. During pretraining, the network will learn a representation of the ground truth function that generates the labels . We can think of the function as a linear combination of basis functions, since it is an element of the vector space . We argue that training in the feature learning regime causes the network to learn these (and only these) basis functions to build a representation of . We demonstrate that this is the case analytically for deep linear networks in Theorem 3.4, which shows that after training, the features of the model (the activations at layer ) only span the one dimensional subspace along . This should be compared with training in the lazy regime, where the function space of the model is fixed at initialization, and the best the model can do is to learn the projection of into the RKHS defined by the NTK. In the deep linear case, this would correspond to fixing the weight matrices to their initial random values and performing regression using features that are essentially a product of random matrices. The fact that the network only represents the features relevant to is the source of negative transfer for downstream tasks. During linear transfer, when only the output weights can be tuned on the target task, the best the network can do is learn the component of that lies along , since its features are fixed to their pretrained values (see Theorem 3.7). You mention that we could instead consider a simple linear regression model and “perform linear evaluation on fresh new weights”. We would greatly appreciate it if you could clarify the experiment you have in mind so that we can effectively address the concern. For arbitrary nonlinear networks, we instead need to think in function space instead of the vector space . The relevant function space is the RKHS defined by the activations at layer after pretraining. We demonstrate that in 2-layer ReLU networks, the pretrained network encodes a basis for this space in Section 4 (see Fig 4a-b). We believe that in deeper nonlinear networks, the picture is the same, but identifying the features becomes more complicated. For this reason, we focus on deep linear networks in the student teacher setting, as the features are readily identifiable and we can prove rigorously that our feature learning hypothesis holds. To address this confusion, we plan to include an “Our Contributions” section to the introduction, where we describe this qualitative picture more clearly. Your observation that depth plays no role in generalization is important and we will comment on it in the main text. In our setting depth does not play a role since the ground truth function is linear, as you point out. In nonlinear networks with more complicated ground truth functions, depth will modify what features of are learnable, but will not fundamentally change the argument. Whatever features are learned define an RKHS which constrains how well the target function can be regressed. Finally, we’ll mention that in the full fine tuning case, the situation is more complicated. We are able to theoretically describe this setting in deep linear networks since these models enjoy a number of favorable analytical properties (see Eq. 35) that allow us to describe gradient flow from an arbitrary initial condition. This kind of analysis is notoriously difficult for an arbitrary network. We are unaware of existing theoretical techniques to describe optimization dynamics in deep nonlinear networks from non-random initializations.
Minor issues/questions
- M is dimensionality of the subspace of . It is defined through the notation {\phi_i\}_{i=1}^M in the statement of Assumption 2.1
- The Dudley metric is an Integral Probability Metric, as mentioned in the text preceding the statement of the Theorem (line 150) and defined precisely in Appendix B Eq. 23
- We will add a few lines about Equation 22 in the main text for clarity. This is mostly a technical assumption. It is the most general initialization we could find in the literature for which our results on gradient flow dynamics hold. We note that it holds with high probability for He initialization, making it a valid assumption in realistic settings.
I sincerely thank the authors for their extensive efforts in the response. I also acknowledge the other Reviewers' responses.
- I agree that the training dynamics in the feature learning regime are very complicated for deep linear nets, especially compared to the linearized (NTK) regime. My comment on the limitations of this work is that when it comes to transfer learning, the interesting aspects of feature learning come from learning nonlinear features. In a sense, studying a two-layer nonlinear network is more insightful than understanding linear networks of arbitrary depth, as is apparent in Sections 3.1 and 3.2, where the depth of the model does not enter the results. In this sense, it is my understanding that studying a two-layer linear network would have provided the same information. However, I agree that this is more complicated for a theoretical study (as pointed out by the authors in Section 4), but I am still of the same opinion on the inherent limitation of linear networks in a transfer learning context. This is also what I meant regarding the thought experiment I suggested (i.e. would a single hidden-layer linear network have provided the same results?). I realized it was not correctly phrased, and I apologize for that (I am not asking to run additional experiments). I really appreciate that the authors will include a discussion on the role of depth.
2-3. I sincerely appreciate that the authors will make this point clearer in future versions of the manuscript. And I also thank the authors for the clarifications. In the submitted version, in my opinion, the paper lacks a discussion on the role of data-centric metrics vs the proposed feature-centric metrics. Actually, I would really focus more on the nice theoretical results of Section 3, and a bit less on the dispute between feature-centric and dataset-centric metrics. I acknowledge that overcoming the problem of double descent in the theoretical framework is complicated due to the technical difficulty of studying the dynamics in a regularized setting.
- We would like to emphasize that the regularization we consider in this section is for the transfer algorithm, not the scratch-trained model. Equation 15 describes the generalization error of transfer learning using ridge regularized tuning of the final layer weights. Therefore, the expression for the transferability is simply Eq. 10 minus Eq. 15. This corresponds to replacing the term in Eq. 13 with Eq. 15. Since we show that Eq. 15 is monotonically increasing in the ridge parameter , the transferability is strictly worse (smaller) for any nonzero . Since the scratch training procedure error is the same in this setting, the double descent phenomenon is still present. Since this concern is related to Weakness 2, we feel that it will be adequately addressed by the numerical experiment proposed in our response. With that being said, we will adjust the wording of Section 3.3.1 to emphasize that the regularization we consider is on the transferred model, not scratch training.
This paper addresses the gap in understanding how to adapt large pre-trained models to data-limited tasks by examining the theory behind task similarity in transfer learning. It challenges the conventional belief that similarity between source and target data distributions, such as through ϕ-divergences, predicts transfer success. Instead, it suggests a feature-centric approach where transfer learning is effective when the target task aligns well with the feature space of the pre-trained model. The authors demonstrate their theory using deep linear networks, providing insights into when transfer or fine-tuning outperforms training from scratch, especially with limited data.
优点
1- Even thought the structure of the paper is atypical (e.g., short introduction with related work being a subsection), the paper is well-written and easy to follow.
2- Transfer learning is very important topic in DL and understanding when/why it works is crucial for the overall understanding of deep learning.
3- I did not read all the proofs in extreme details. I only checked the skeleton of the proofs and they seem correct and the results are reasonable. However, I am not update with the current advancements in transfer learning theory so i can not judge the novelty of the proofs/theoretical results. Hence so the low confidence in the score.
缺点
1- Lack of the empirical results: the authors only consider a small example of a two layer neural networks. So it is hard to see if the conclusion/insights of their theory can really translate to deep state-of-the-art models.
2- The proposed theory does not offer any insights on the design of the representation learning approaches and why some work better than other (e.g., contrastive learning with self-supervised learning) which in my opinion is a very interesting question crucial to understand transfer learning.
问题
Q1- Does the proposed theory shed some light into why some approaches (e.g., contrastive learning with self-supervised learning ) learn better and more transferable representation compared to others?
Q2- Is it possible to use the proposed theory to improve the transfer learning capabilities of the models? For example, during the representation learning phase with the source data, we can add a regularizer that tries to maximize the the overlap in the feature space using a small amount of the target data. I would like to hear the authors thoughts on this and other ways to leverage this theory in practice.
Dear Reviewer NGae,
Thank you for your detailed review of our paper. We are happy to hear that you found it “well-written and easy to follow”. We also thank you for your feedback on its weaknesses, which we address below.
Weaknesses:
- While we agree that empirical verification in deeper networks would be compelling, the focus of our paper is primarily theoretical, so we have chosen to use the allotted pages to clearly describe the theory and its implications. The goal of our theory is to explain the origin of a phenomenon already observed in large scale deep neural networks, as demonstrated empirically in:
- “Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution” (Liang 2022). This paper demonstrates a failure mode in fine tuning related to feature overlap for a number of benchmark distribution shift experiments.
- “What is being transferred in transfer learning” (Zhang 2021). This paper shows empirically that feature similarity drives positive transfer in a number of benchmark tasks. Thus while we appreciate your request to carry out experiments in deeper networks, this has partially already been done in prior experimental work (though without any theoretical explanation), and we feel it is a strength of our theory that it quantitatively explains what is happening linear networks, and provides a good description of two layer nonlinear network experiments in our paper, as well as deeper network experiments in prior work.
- While self supervised learning is beyond the scope of this work, we believe our theoretical results have implications for this setting, as well. Consider a self supervised technique such as SimCLR. The objective of the network is to collapse augmented pairs toward similar network representations and pull different examples away from each other in representation space. Such an unsupervised representation learning algorithm would preserve features that are important for a downstream classification task if and only if data augmentations transform data points from one downstream task class into data points that are recognizably from the same class. Thus the data augmentation should scramble features that are irrelevant for determining the downstream task’s class labels, while preserving features that are important for determining the downstream task’s class label. If it does so, then features learned through SimCLR should transfer positively to the downstream task. So the logic is the same as in our paper: if the pretraining tasks (whether an upstream classification task or SimCLR representation learning) learns features that are useful for computing class labels in the downstream classification task, then, and only then, will positive transfer occur. To see this explicitly in a toy example consider a downstream classification task defined by two class conditional distributions consisting of two two-dimensional Gaussians with isotropic covariances and means and . The optimal binary classifier is the vector . We can consider SimCLR data augmentations that create augmented pairs from a single example by adding a small perturbation in the direction orthogonal to . This would collapse irrelevant dimensions for the downstream task and therefore lead to positive transfer on the downstream classification task. On the other hand if the data augmentation consisted of adding noise in the direction of , SimCLR would then collapse a feature dimension important for the downstream task, leading to negative transfer. This is the analog of feature mismatch in our theory.
Questions:
- Addressed above.
- We do believe that our theory is actionable since it highlights the potential pitfalls of pretraining in the feature learning regime. That is, learning the source function too well may actually hinder transfer performance if the target task does not lie in the same feature space as the source function. This suggests a number of regularization approaches that will lead to more flexible pretrained models, which we plan to discuss in the conclusion of our revised draft. One such method is to generate a sweep of pretraining runs that span the lazy learning to feature learning axis. This is done by adding a regularization term to the pretraining loss, where is weight vector and is its value at initialization. When is very large the weights do not move and the model trains in the lazy regime. We can then transfer models with varying to the target task and choose the one that minimizes the generalization error on the target task. We will include the results of this experiment in the appendix of our revised draft, but to summarize, choosing the optimally regularized pretrained network eliminates negative transfer.
I thank the authors for their answers, especially the self-supervised learning example.
1- I agree with Reviewer YzFe that from transfer learning perspective, theoretical analysis based solely on linear models is very limited because ''when it comes to transfer learning, the interesting aspects of feature learning come from learning nonlinear features.'' This is why i advocated for adding some experiments, where you can maybe show, that even though the results are derived for linear models, they hold 'true' in practice for non-linear models. This would add more value to the results of this paper and show that they can be insightful beyond the linear regime.
2- Another point, from a theoretical perspective, [1,2] do a more fine-grained analysis of transfer learning (with results holding both linear and non-linear models). However, the paper do not mention or discuss these works.
[1] Aminian, Gholamali, Samuel N. Cohen, and Łukasz Szpruch. "Understanding Transfer Learning via Mean-field Analysis." (2024). [2] Bu, Yuheng, et al. "Characterizing and understanding the generalization error of transfer learning with Gibbs algorithm." International Conference on Artificial Intelligence and Statistics. PMLR, 2022.
Dear Reviewer NGae,
Thank you for your additional comments, and for directing us to these two relevant works— they have been cited in our revised draft. We address your first point in Section 4 of our paper, where we study two layer nonlinear networks and describe a generalized framework for analysis that qualitatively agrees with our findings in deep linear networks.
This paper aims to provide new theoretical insights into fine-tuning and transfer learning with large-scale foundation models. In particular, they aim to improve the theoretical basis of measuring the similarity between two machine learning tasks - the baseline task for which the foundation model was trainied and the target task for which we are fine-tuning/transfering. The authors argue that one should study this problem from the standpoint of data features. The authors claim to make progress and relate their results to feature learning dynamics.
优点
The authors are correct that a theory of fine-tuning foundation models is lacking, and so work in this space is welcome. Indeed an explanatory theory for transfer/fine-tuning of nonlinear, deep networks would be a significant advance in the field.
The feature-centric viewpoint appears novel.
缺点
I found it difficult to assess the contributions of the paper because of unclear explanations. In particular, Theorem 2.2 is crucial because it is claimed to motivate the work (“General Theoretical Setting”), but I cannot follow the argument. What I can best infer from the text is that the authors are saying the following: If the source and target distributions are the same (or similar) but the features are different, then transferring will fail. This means that, even if the data is the same for both source and target, if a network f is the pretrained one, then we can find a "bad" network g that has worse performance. But this seems obvious, since one can pick a random network g that will with overwhelmingly high probability have worse performance than any trained network. My suspicion from this is that the authors’ results won't actually hold up in practice. But it is hard to tell because the set up is so unclear.
The claimed theory (that I could not exactly follow due to the unclarity above) is proved only for deep linear networks and shallow (e.g, 2 layer) nonlinear networks, which significantly reduces its potential explanatory potential in realistic deep learning settings.
问题
If I have a misconception regarding Theorem 2.2, can you please clarify your explanation? The entire paper revolves around this result.
Thank you for your discussion around Theorem 2.2. While reading your rebuttal plus re-reading the paper, I have a few additional concerns that should be discussed.
-
You initially say that x and y are jointly distributed and that the labels are real numbers y in R (second sentence in Section 2). This is fine although a slight abuse of notation since random variables are not numbers. Then you define noisy versions of y and call them ys and yt. They are essentially noisy transformations of x. So far that is fine. Now in Assumption 2.1, you say that x is a data point and that y is a function on x that derives its statistical properties from f(x) + e. This is incorrect, because, if y is a function and not a random variable, then it cannot have statistical properties. Essentially, you are simultaneously saying that y=f(x)+e AND that y is a function of x (ie y(x) =f(x)+e), which doesn't make sense. Additionally, if x is a point, then y is just Gaussian since e is Gaussian. So where did the randomness in x go? Also are you only considering Gaussian data? Actually, it seems to be univariate Gaussian, and so the most basic case possible.
-
Your explanation is helpful but not complete. For example, let's say that we train the source model and get f which is essentially a wavelet transform. Then the basis functions phi's are just some wavelet basis in L^2. Then the subspace is the entire space. Then you are saying that you can find another function using the wavelet basis that gives poor performance. Obviously that is true -- just pick random wavelet coefficients. And if you want larger error you can just scale the coefficient weights.
-
Point 2 seems especially true, since you do not say that f and g are similar. You just say that you can find a g in the same subspace that f is in that is different.
-
If x has a fixed distribution for source and target tasks but fs and ft are different then the joint distributions p(x,ys) and p(x,yt) will be different. So I am not sure how you can claim in your rebuttal that "it is impossible to build two tasks with identical distribution for which transferring is poor". It must be that you are defining these functions, somehow, by learning them. But it is not clear how. Maybe you are assuming that the networks converge to a global minima?
Dear Reviewer Ct4H,
We thank you for your review and apologize for any confusion stemming from the presentation of Theorem 2.2. In words, Theorem 2.2 says the following: consider a task with labels defined by as in Eq 1. This function induces a joint distribution on the input/label pairs . Since lives in by the assumption in the Theorem, we can express it as a linear combination of basis vectors . Therefore, lives in a subspace of defined by the span of the basis vectors . We prove in the Theorem that we can always find another function in the same subspace for which these distance based measures between and are arbitrarily large (i.e. for any ). We provide a constructive proof in Appendix B.1. Now consider defining a transfer learning scenario using the distributions and as source and target distributions respectively. If training is succesful on the source task, the network will build a representation of the function . Since and can be built from different linear combinations of the same basis vectors (or features in the language of our paper), transfer learning will work very well between the two tasks. For example, one could simply train the output weights on the new task defined by to learn the correct new linear combination of features. This is a scenario in which transfer is efficient, but the distributions and are very far apart as measured by these metrics. Conceptually, the purpose of this Theorem is to demonstrate that these metrics cannot reliably predict transfer performance, since one can always construct these counterexamples.
Your understanding of the Theorem appears to be the opposite of what we proved. In fact, it is not possible to build two tasks with identical distributions, for which transfer performance is poor. The reason for this is the following: we expect transfer efficiency to be poor when the important features associated with the two tasks do not lie in the same subspace of , since the network must learn additional features in order to express the target function. If two tasks have the same distribution, they must be defined by the same function and trivially lie in the same subspace.
Thank you for pointing out that this picture was not clear in the current manuscript. We hope that with this explanation, you will consider re-evaluating the paper. We will reword this section in our revised draft to address any confusion. In addition, we plan to include an “Our Contributions” section after the introduction in which we clearly map out our argument.
Thank you for your response. I added a few additional comments in the original review that I will repeat here.
Thank you for your discussion around Theorem 2.2. While reading your rebuttal plus re-reading the paper, I have a few additional concerns that should be discussed.
You initially say that x and y are jointly distributed and that the labels are real numbers y in R (second sentence in Section 2). This is fine although a slight abuse of notation since random variables are not numbers. Then you define noisy versions of y and call them ys and yt. They are essentially noisy transformations of x. So far that is fine. Now in Assumption 2.1, you say that x is a data point and that y is a function on x that derives its statistical properties from f(x) + e. This is incorrect, because, if y is a function and not a random variable, then it cannot have statistical properties. Essentially, you are simultaneously saying that y=f(x)+e AND that y is a function of x (ie y(x) =f(x)+e), which doesn't make sense. Additionally, if x is a point, then y is just Gaussian since e is Gaussian. So where did the randomness in x go? Also are you only considering Gaussian data? Actually, it seems to be univariate Gaussian, and so the most basic case possible.
Your explanation is helpful but not complete. For example, let's say that we train the source model and get f which is essentially a wavelet transform. Then the basis functions phi's are just some wavelet basis in L^2. Then the subspace is the entire space. Then you are saying that you can find another function using the wavelet basis that gives poor performance. Obviously that is true -- just pick random wavelet coefficients. And if you want larger error you can just scale the coefficient weights.
Point 2 seems especially true, since you do not say that f and g are similar. You just say that you can find a g in the same subspace that f is in that is different.
If x has a fixed distribution for source and target tasks but fs and ft are different then the joint distributions p(x,ys) and p(x,yt) will be different. So I am not sure how you can claim in your rebuttal that "it is impossible to build two tasks with identical distribution for which transferring is poor". It must be that you are defining these functions, somehow, by learning them. But it is not clear how. Maybe you are assuming that the networks converge to a global minima?
Dear Reviewer Ct4H,
Thank you for your additional comments, we will do our best to address any confusion here. To your first point, it is true that random variables are formally functions that map from sample space to some measurable space. In our paper, we use the measurable spaces for the outcomes of and for the outcomes of and simply use the notation . Although this is formally an abuse of notation, it is standard in the literature (see Learning Theory From First Principles by Francis Bach for example). You should think of the data points as jointly distributed in . The function maps from to and the values of the labels are noised outputs of this function. Since the noise is normally distributed, then the conditional distribution is gaussian, but we make no assumptions on the marginal probability measure of , aside from the constraints that it admits a density and that is square integrable with respect to this measure, which is manifest in the fact that itself is a member of the Hilbert space . Therefore the joint density is not gaussian in general. This set up is also a standard model for regression in the literature.
The idea we are proposing is that if the network creates a basis for the shared source-target subspace during pretraining, the model should be able to transfer well to any other task who’s labels are derived from a function in that subspace, since the network can already express any function in that space by adjusting its output weights. Theorem 2.2 demonstrates, however, that measuring task similarity using the Dudley metric or the KL divergence can mislead you, since you can find an arbitrarily distant task (with respect to these metrics ) in the subspace. Although there is no assumption on how the functions are learned in the proof of our Theorem, our interpretation of the result in the context of transfer learning imagines that pretraining learns the source function well. In our response to your original comment, for example, we explain how the theorem pertains to transfer learning when the pretrained network has learned the Bayes predictor, for the pretrained task. In that case, it is necessary that the network has a feature space that at least spans the subspace that is from. In deep linear models, we prove that gradient flow converges to a global minimum of the population loss, which is Bayes predictor. While our work does not focus on wavelets, your example is no different. If the network learned a basis of wavelet functions up to some cut-off length scale, it will be able to transfer well to any function in that is built from those wavelets, yet the distance metrics could mislead you.
To your final point, if the distributions and are the same, it must be the case that , in which case transfer will be effective since the source and target tasks are the same.
Dear Reviewers,
Thank you all for your useful feedback on our manuscript. We have found all of your points to be insightful and productive in making our paper the best it can be. For your convenience we have summarized the changes to our work below:
- Added an Our Contributions section to the introduction (lines 95-107) that summarizes our main argument
- Included an experiment comparing transfer performance to an optimally regularized scratch trained model (Appendix E Figure 5), which clarifies the “anomalous positive transfer effect”. We have also included some additional discussion of this in the main text (lines 275-277, 356-358)
- Added discussion about the limitations of deep linear networks (lines 215-221)
- Included a discussion of actionable implications of our theory for creating better pretrained models (lines 531-540 and Appendix E Fig 9).
We sincerely thank you for your time in reviewing our work and hope that you find these revisions helpful for the clarity and strength of the paper.
This paper studies under a new feature-centric viewpoint the similarity of source and target tasks in transfer learning. The implications of such an analysis should be important because they can help understand the intricacies of knowledge transfer from a foundation model to a new situation, through e.g. fine-tuning.
It is both interesting and novel how this paper advocates for a “feature view” (over “data view”) for transfer learning. Moreover, the area of understanding transferability and related concepts is significant. The paper is generally well written, however Theorem 2.2 has caused some confusion.
While the theoretical contributions are interesting, the arguments put forward in the paper are not entirely supported by the research conducted. Firstly, the reviewers consider the theoretical analysis carried out on linear models to be limited. They also mention prior theoretical work on transfer learning which actually considers both the linear and non-linear case. Secondly, reviewers point out certain ambiguities in the discussion about data-centric vs feature-centric metrics, and overall remain somewhat confused about the connection between the “storyline” of the paper and the results/claims presented.
Overall, this paper has benefited from five expert reviews who also engaged in discussions, however none of the reviewers felt that this paper in its current form clearly passes the acceptance threshold for ICLR. However, they did point out several directions for improvement, in particular regarding “tying” the motivation with the storyline and results/claims of the paper, as well as, ideally, extending the analysis on non-linear cases.
审稿人讨论附加意见
- The reviewers engaged deeply in discussion with the authors to try and clarify certain ambiguities in the problem set-up, but they remained unconvinced after the rebuttal. The discussion topics focused on the weaknesses pointed out in the meta-review section.
- There was also private discussion among reviewers and AC, where there seemed to be a general consensus that despite this being an interesting idea, in its current form this paper is not ready for yet for publication.
Reject