PaperHub
6.6
/10
Spotlight4 位审稿人
最低3最高4标准差0.5
4
3
4
3
ICML 2025

Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We prove that unsupervised pre-training can dramatically reduce sample complexity in single-index models

摘要

关键词
Single-Index ModelPre-trainingTransfer-learningsample complexityStochastic Gradient Descent

评审与讨论

审稿意见
4

This paper investigates the benefits of unsupervised pre-training in supervised learning tasks, focusing particularly on GLMs with parameter vectors correlated to a spike in the data covariance. The authors show that a "PCA initialization" derived from an independent sample of unlabeled data can drastically reduce the sample complexity threshold for parameter vector recovery through SGD, compared to random initialization. They highlight a setting in which random initialization fails regardless of the amount of data, while PCA initialization does not. Additionally, the authors discuss the benefits of pre-training in transfer learning, which can lead to a polynomial reduction in sample complexity, provided that the correlation between the source and target is strong enough. The findings are validated through finite-dimensional simulations.

给作者的问题

Following the discussion in the strengths and weaknesses section, I would ask the authors to provide further comments on their choice of setting and whether they can offer a stronger justification for it. I suggest expanding this discussion in the paper as well.

论据与证据

All claims are supported by convincing evidence.

方法与评估标准

The methods are well suited to this problem.

理论论述

I checked the proofs for the theorems in section 3, I have no issues to discuss.

实验设计与分析

I have no issues to discuss.

补充材料

I revised the supplementary material, I have no issues to discuss.

与现有文献的关系

This paper builds on existing literature in supervised learning with GLM target functions, extending methods and results from the Gaussian iid setting with random initialization to the spiked covariance problem. The authors introduce a pre-training-based initialization procedure, which improves learnability in the considered setting, offering a significant advancement over traditional random initialization approaches in high-dimensions.

遗漏的重要参考文献

I am not aware of any essential references that have been omitted.

其他优缺点

The paper is clear in its exposition and addresses the important issue of the theoretical understanding of the benefits of pre-training, deriving interesting results that support the argument. One weakness is the artificial nature of the data distribution, which may seem tailored to make the proposed pre-training method work. In my opinion, the authors do not discuss this artificiality sufficiently, nor do they highlight how this toy model could resemble more realistic settings and how the results may translate to them. Nonetheless, this setting provides clear theoretical insights and original results, and could serve as a baseline for future work in more complex or realistic contexts.

其他意见或建议

A minor issue is the inconsistency in the use of parentheses around citations (for example, on lines 327-328, second column). I suggest revising the paper to correct this.

作者回复

Thank you for the thoughtful comments and suggestions.

  • Use of parentheses around citations---Thank you for pointing out this issue. We will be sure to correct this in the final version.

  • Limitations of the data distribution---We are happy to include additional comments on the motivations for our data distribution, and its limitations. Let us recap here: Our goal is to understand the benefit of pre-training. For this, one must have a solid baseline where the sample complexity without pretraining is exactly understood. We work with single index models/GLMs as they are, to our knowledge, the only non-trivial models of supervised learning with neural networks where the sample complexity of online SGD is exactly understood.

    For pre-training to be successful, the (unlabeled) features must have some non-trivial information about the downstream prediction task. For single-index models, this correlation must occur either as a non-trivial mean for or in the covariance of the features. The question is which more naturally models what is seen in practice. In the context of modern language models, a popular pre-training strategy is to construct artificial prediction tasks by 'masking' tokens in the unlabeled text data. Intuitively, one hopes to construct 'clozed masks', i.e., masks that are helpful to predict the downstream label. This masking strategy has been deployed in well-known models such as BERT (Devlin et. al. (2018)). More generally, it is common to perform pre-training via denoising autoencoders. In light of this, and the references in Section 3.1 regarding autoencoders, it seems more natural to relate the features to the downstream prediction tasks via the feature covariance. Given this, the spike covariance model is a natural, albeit simple starting point. Let us end by emphasizing that we are not proposing PCA as a general purpose pretraining algorithm. Instead, we use it as a natural model of pretraining in our set up. We hope that this simple analysis will motivate further investigations into richer models.

审稿人评论

Thanks for your reply and clarifications, in light of which I am updating my score to "accept".

审稿意见
3

The paper provides a theoretical analysis of how model initialization learned via unsupervised pre-training benefits supervised learning tasks when training a single-layer neural network using online stochastic gradient descent.

给作者的问题

  1. Could the authors elaborate on the assumption (1) of ff (Line 168)? I appreciate the discussion in Section 3.4. However, is the commonly used ReLU function not satisfied? Could we weaken the third requirement to let the ‘>0’ to be ‘’0\geq 0’?
  2. Since in your setting the data sample and the gradient update steps are the same, it leads to a lower bound of sample complexity in Theorem 3.4. However, in reality, these two terms are not the same. If we let them to be two separate terms, what can you get?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I did not check the proofs.

实验设计与分析

Yes.

补充材料

No.

与现有文献的关系

The setting is interesting, but it requires some strong assumptions.

遗漏的重要参考文献

No.

其他优缺点

Strength: The paper provides a clear and well-motivated setup, and is strong in terms of readability and structure. The problem formulation and theoretical contributions are presented clearly and logically; in particular, the discussion sections, as well as the intuitive explanations of theorems and assumptions, are quite helpful.

Weakness: Assumption 3.2 significantly simplifies the analysis and may be overly restrictive. It requires that the gradient with respect to all data samples let the update towards the direction e1e_1 and be far from e2e_2. However, in realistic scenarios, gradient descent updates could have more complex trajectories while still converging optimally. This assumption might limit the general applicability of the theoretical insights.

其他意见或建议

See questions.

作者回复

Thank you for the thoughtful comments and suggestions.

  • Weakness of Assumption 3.2---Please note that Assumption 3.2 pertains to gradient flow on the population loss (population gradient flow) and not to online SGD. Indeed, one of our main contributions is to establish rigorously that if the step sizes are reasonably small, the evolution of the correlations of the SGD trajectory with the latent directions is well approximated by that of population gradient flow.

    We believe that Assumption 3.2 is quite mild. Note that (1,0)(1,0) is the global minimizer of the population loss. Assumption 3.2 is essentially equivalent to the assumption that the global optimizer is a local attractor for population gradient flow. This ensures that if population gradient flow is initialized in a small neighborhood of (1,0)(1,0), it will converge to the optimum in short time. We agree that the precise form of Assumption 3.2 could be potentially weakened---we stated it in a form which was easy to verify. In our paper, we verify this condition formally when ff is a Hermite polynomial.

  • Assumption 1---There has been significant recent progress in understanding the properties of gradient descent algorithms on single index models with isotropic gaussian features. Prior work establishes that the sample complexity of one pass SGD is governed by the "information exponent'' of the problem. In this setting, the information exponent is the index of the first non-zero term of the expansion of the function ff in the Hermite basis. Assumption 1 is in the same spirit---we assume that ff has information exponent at least three, while f2f^2 has information exponent at most two. We note in the paper that this assumption is satisfied by all Hermite polynomials of degree 3\geq 3, and suitable linear combinations---thus this assumption is satisfied by a broad class of non-linearities ff. We note that the RELU function does not satisfy this condition; however, the RELU activation function has information exponent one, and thus can be learned with approximately linear sample complexity. The condition focuses on more challenging non-linearities which have polynomial sample complexity for the isotropic single index model.

  • Data sample vs. gradient updates---We focus exclusively on single pass SGD in this work. We agree that it would be interesting to study other variants e.g. batched SGD or multi-pass SGD in future work. We believe that our analysis can be extended to settings where each epoch uses fresh batches of data, but the batch sizes are relatively small. This variant will not change the sample complexity of the algorithm.

审稿人评论

Thank you for the clarification! I increase the score accordingly.

审稿意见
4

This paper studies the dynamics of online SGD for learning single-index models for spiked Gaussian data, where the spike in the Gaussian distribution is meaningfully correlated with the ground truth direction. The authors model pretraining and fine-tuning in practical machine learning settings as learning the spike direction and learning the ground truth direction. In that sense, the authors prove that under a certain assumption on the target function and sufficient correlation between the spike and the ground truth direction, SGD initialized from PCA pretraining is successful at recovering the ground truth direction with only a linear number of samples. In contrast, the authors prove negative results that show Ω(d2)\Omega(d^2) samples are necessary from random initialization, and further when the spike and ground truth directions are identical, even constant initial correlation is not sufficient, regardless of the amount of data used.

给作者的问题

  1. Theorem 3.3 does not mention the assumption on λ\lambda. I believe it is assumed that λ>0\lambda > 0 is dimension-independent, in which case it would be helpful to explicitly mention this. Also, what happens if we have λ\lambda and η1\eta_1 that scale with dimension? Mousavi-Hosseini et al., 2023 consider λdr2\lambda \asymp d^{r_2} and η1dr1\eta_1 \asymp d^{-r_1}, and show that the sample complexity of full-batch gradient flow with PCA initialization is d1+max((k1)(2r1r2),0)d^{1 + \max((k-1)(2r_1 - r_2), 0)} for information exponent kk, while Mousavi-Hosseini et al., 2025 consider gradient descent with exponential compute and show the sample complexity is d1max(r22r1,0)d^{1 - \max (r_2 - 2r_1,0)} independent of information exponent. It would be helpful if the sample complexity from Theorem 3.3 could be compared with these results.

  2. Moreover, it can be helpful to note the sample complexity improvement of Theorem 4.2 is identical to Theorem 7 of Mousavi-Hosseini et al., 2023. The authors there turn gradient flow on spiked covariance into gradient flow on isotropic data with non-random initialization, and achieve a sample complexity of d1+2ζ(k1)d^{1 + 2\zeta(k-1)}. Theorem 4.2 of this paper provides an improved d2ζd^{2\zeta} factor. This is similar to the improvement of online SGD (Ben Arous et al. 2021), over gradient descent/flow (Bietti et al., 2022) on isotropic Gaussian data .

  3. I believe Equation (1) asks for f to have information exponent >= 3 and f^2 to have an information exponent <= 2. It would be helpful to discuss the relationship between information exponent and Equation (1).

  4. My interpretation was that rr in Theorem 3.5 is not dimension-dependent. If that is the case, it would be helpful to explicitly mention it in the theorem statement.

  5. I think B2(0,1)B_2(0,1) is not defined in the text, does it mean the 2-dimensional ball of radius 1 centered at 0? Also, is there any way to make Assumption 3.2 more interpretable, e.g. for single Hermite polynomials? In which case, can we explicitly know the dimension-dependency of mm^*?

References:

A. Mousavi-Hosseini et al. "Gradient-Based Feature Learning under Structured Data." NeurIPS 2023.

A. Mousavi-Hosseini et al. "Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics." ICLR 2025.

G. Ben Arous et al. "Online stochastic gradient descent on non-convex losses from high-dimensional inference." JMLR 2021.

A. Bietti et al. "Learning Single-Index Models with Shallow Neural Networks." NeurIPS 2022.

论据与证据

The claims are clearly supported by proofs and some experiments.

方法与评估标准

Not applicable.

理论论述

I did not explicitly check the correctness of the proofs, but the statements seem plausible and in agreement with prior results.

实验设计与分析

Not applicable as this is mostly a theory paper.

补充材料

I did not review the details in the supplementary materials.

与现有文献的关系

The paper provides a valuable contribution to the high-dimensional SGD literature, especially since learning single-index models has seen a lot of recent attention as a model of recovering low-dimensional structure in high-dimensional spaces and feature learning with neural networks. While the results may not be very surprising, they offer a nice perspective on the principles of pretraining and fine-tuning, even for simple models such as learning single-index models.

遗漏的重要参考文献

Essential references seem to be sufficiently discussed. I provide additional discussion suggestions below.

其他优缺点

Strengths:

The paper presents a nice story with intuitive arguments that are easy to follow, and the negative result of Theorem 3.4 is particularly strong as it covers many initializations and any number of samples.

Weaknesses:

  1. The role of certain parameters in the results is not clear. For example, how would λ\lambda or η1\eta_1 that scale with dimension affect Theorem 3.3? Also, Assumption 3.2 is a bit opaque. The authors provide some arguments to support this assumption in Section 5, but I still don't have a good intuition of how mm^* behaves with respect to dimension in typical settings. I would be happy to raise my score if there is more discussion and examples on these points.

  2. A limitation of the lower bounds is that they only apply to sufficiently small step size, αδ2=O(1)\alpha\delta^2 = O(1) and αδ2=O(d1/3)\alpha\delta^2 = O(d^{1/3}). Intuitively, SGD updates should become unstable beyond this step size, but this is not theoretically shown in this work or prior works such as Ben Arous et al., 2021.

其他意见或建议

Please see below.

作者回复

Thank you for the thoughtful comments and suggestions.

  • Notation and dimension dependence of parameters---We assume throughout that λ>0\lambda>0 is dimension independent. B2(0,1)B_2(0,1) is indeed the two-dimensional ball of radius 1 centered at the origin, and mm^* is also dimension independent in our analysis. We will clarify these dependencies in the final version.

  • λ\lambda and η1\eta_1 scaling---We assume that λ\lambda and η1\eta_1 are dimension independent in our work. We expect that the analysis we provide here can be carried over to the case (following your notation) r1=0r_1=0 and r2>0r_2>0. We agree that the case r1>0r_1>0 is interesting. However it will require substantially more work: One would have to carry out a similar “bounding flows'' style argument, but around the point (0,1)(0,1) instead of (0,0)(0,0). In light of the tight turn around for this conference, we defer this to future work.

  • Comparison with Mousavi-Hosseini et. al. (2023, 2025)---Thank you so much for pointing us to these results. As remarked above, if λ\lambda and η1\eta_1 scale with dd, our regime is related to the prior work. Thank you also for pointing us to the sample complexity improvement in Mousavi-Hosseini et. al. (2023). We will add a discussion of these connections in our final version.

  • Information exponent clarification---Thanks for pointing this out. We will include your interpretation in the final version. Please see the response to reviewer DGqD.

  • Assumption 3.2, dimension dependence of mm^* and dimension dependence of rr in Theorem 3.5---Thanks for these comments. Indeed, the parameter rr in Theorem 3.5 is dimension independent. Assumption 3.2 is equivalent to the condition that the global minimizer of the population risk (i.e. (1,0)(1,0)) is locally attractive. Thus, the population dynamics initialized in a neighborhood of the global minimum will converge to this minimum. Finally, note that mm^* is dimension independent---to see this, note that mm^* can be specified in terms of the function ϕ\phi, which is dimension independent.

  • Step-size for SGD---If the step size is larger, the SGD does not follow the population gradient flow. This invalidates the main analysis strategy introduced in Ben Arous et. al. We adapt these ideas in our analysis and are thus also unable to handle SGD with large step size.

审稿意见
3

This paper investigates the benefits of unsupervised pre-training and transfer learning in the context of high-dimensional supervised learning, specifically in single-layer neural networks trained via online stochastic gradient descent. The authors establish that pre-training and transfer learning can significantly reduce sample complexity and, in some cases, even lead to exponential improvements over random initialization. The analysis relies on single-index models with spiked covariance and characterizes the effects of different initialization schemes. A key result is that pre-training enables efficient learning in scenarios where random initialization fails due to the existence of attractor basins. The paper also extends its analysis to transfer learning and shows how pre-trained feature representations improve learning efficiency.

给作者的问题

Theorem 3.5 suggests that below a certain correlation threshold, learning does not occur. Could this result be an artifact of the chosen expansion, rather than a fundamental limitation? Have you considered alternative pre-training methods beyond PCA, and how would your results extend to those cases? The numerical experiments show smooth transitions rather than sharp phase transitions. How do you justify using the term “phase transition” in this context?

论据与证据

The paper makes several claims regarding the benefits of pre-training and transfer learning. While the analysis generally follows well-established approaches in the study of high-dimensional inference, some claims appear somewhat engineered. For instance, the role of the Information Exponent is controlled to illustrate benefits, which raises concerns about whether the observed advantages of pre-training arise naturally or are artificially emphasized. The negative result in Theorem 3.5—establishing a threshold below which learning does not occur—is an interesting contribution, but it might be an artifact of the specific Taylor expansion used rather than a fundamental property of the model. The claim that pre-training yields exponential benefits in certain cases is intriguing, but more discussion is needed on how realistic those cases are in practical settings.

方法与评估标准

The methodology follows a standard approach for analyzing the sample complexity of single-index models. The authors leverage stochastic gradient descent in a spiked covariance model and analyze the evolution of correlations over training. However, the connection to prior work, particularly the analysis of Ben Arous et al., is very strong, making the contribution appear incremental. The paper could benefit from a clearer comparison to these prior works to highlight what is truly novel in the methodology.

Additionally, the use of PCA as a pre-training strategy is reasonable but somewhat simplistic, as it does not capture more sophisticated pre-training methods commonly used in deep learning. Howver, this might eb expected for analysis purpose, and i agre with the justification of the author (who citie the statistical properties of PCA)

理论论述

The theoretical results certainly correct, but they rely on assumptions that may limit their applicability. The proof techniques align with prior work on high-dimensional learning dynamics and information exponents. The claim that pre-training reduces sample complexity is well supported, though it is not surprising given prior literature. Theorem 3.5 presents a negative result that suggests the existence of a minimum correlation threshold for learning, which is an interesting insight. However, the ability to set the leading term in the Taylor expansion to a negative value raises concerns about whether this result is intrinsic to the problem or simply an artifact of the chosen expansion.

实验设计与分析

The empirical results support the theoretical claims but do not provide strong additional insights. The numerical simulations primarily verify the sample complexity predictions rather than exploring the broader practical implications of pre-training. One limitation is that the experiments focus on synthetic settings with controlled spiked covariance structures, which may not fully capture real-world scenarios. Moreover, the figures suggest that the transition between learning and non-learning regimes is relatively smooth rather than sharply defined, raising questions about whether the term “phase transition” is appropriate in this context.

补充材料

The supplementary material includes additional proofs and numerical results. It provides the necessary technical justifications for the main claims

与现有文献的关系

The paper is well-situated within the existing literature on pre-training and transfer learning in high-dimensional learning models.

On a side note, the spiked random feature model has been investigated in prior works such as, with a different movitations, sich as:

  • Ba, J., Erdogdu, M. A., Suzuki, T., Wang, Z., Wu, D., & Yang, G. (2022). High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35, 37932-37946.
  • Cui, H., Pesce, L., Dandi, Y., Krzakala, F., Lu, Y. M., Zdeborová, L., & Loureiro, B. (2024). Asymptotics of feature learning in two-layer networks after one gradient-step. arXiv preprint arXiv:2402.04980.
  • Dandi, Y., Pesce, L., Cui, H., Krzakala, F., Lu, Y. M., & Loureiro, B. (2024). A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities. arXiv preprint arXiv:2410.18938.

遗漏的重要参考文献

When the paper cite Lee et al 2014, they should cite the near identicial but prior results of Arnaboldi et al arXiv:2405.15459. In fact, i would be curious about the effect of training discussed in Lee et al et Arnadoli et al in the present paper.

其他优缺点

Strengths:

Provides a theoretical framework for understanding pre-training and transfer learning in high-dimensional learning problems. Derives sample complexity results that confirm the benefits of pre-training in structured feature models. Highlights the role of initial feature correlation in determining the success of training, which is an important insight. Weaknesses:

The theoretical results, while correct, are largely incremental relative to prior work. The pre-training strategy (PCA) is somewhat simplistic compared to modern deep learning methods. The empirical validation is limited to synthetic experiments and does not explore broader practical implications. Some results, such as the failure of learning below a certain correlation threshold, may be artifacts of the mathematical formulation rather than fundamental properties of the problem.

其他意见或建议

The authors should more explicitly compare their results to prior work, particularly in relation to the spiked random feature model and information exponents.

The numerical experiments could be extended to investigate how different levels of correlation affect learning outcomes.

The discussion of Theorem 3.5 should clarify whether the threshold phenomenon is a fundamental property of the model or an artifact of the analysis.

作者回复

Thank you for the thoughtful comments and suggestions.

  • Controlling the information exponent---We indeed control the information exponent of ff. However, this assumption is satisfied by all Hermite polynomials of degree at least three. We can also check this condition for many polynomials constructed as a combination of these monomials. Consequently, our assumption captures a reasonably generic class of functions, and is not an edge case.

  • Taylor expansion in Theorem 3.5 & minimum correlation threshold--- Please note that this is a Taylor expansion of the population loss and is necessarily fundamental to the model. The sign of the coefficient depends on properties of the data distribution, namely ff and λ\lambda and cannot be changed at will. The desired sign on the coefficient holds for a broad class of functions (see above). There does exist an rr (depending on ff) such that one-pass SGD fails to learn the latent signal for initial correlations less than this value. This correlation value rr is fundamental to the model, and is obtained by analyzing the population loss landscape around the origin (m(X0)=0m(X_0)=0) for functions ff satisfying the assumptions stated above. Note that we do not aim to exactly characterize the class of functions ff which lead to this phenomenon. Instead our goal is to illustrate the benefits of pretraining on a class of commonly studied models.

  • 'Phase transitions' between not-learning and learning--- We do not use the terms 'phase transitions' and 'sharp thresholds' in our paper. However, as a consequence of Theorem~ 3.5 one can show a sharp phase transition for the sample complexity for a fixed ff as λ\lambda varies: When λ>0\lambda>0 (the case studied in the paper) the information exponent is infinite so the sample complexity is super-polynomial. If, however λ0\lambda\leq 0, the sample complexity is polynomial.

  • Other pretraining methods---We study PCA as a natural pretraining algorithm for this data model for the reasons described in Section 3.1. Please see the references therein for various settings where it is argued that more sophisticated pre-training methods effectively implement PCA. It would be interesting to extend the analyses to more realistic pre-training algorithms. That said, we note that our arguments would naturally extend to other pretraining algorithms if one is able to characterize their `overlap' with the planted signal. This, however, is beyond the current paper.

  • Theoretical novelty--- The main technical novelty compared to Ben Arous et. al. arises from the analysis of the population gradient flow. In that work, the population gradient flow reduces to a one dimensional system whereas here we have a two-dimensional dynamical system in our study, which exhibits subtle properties due to the presence of local traps. The analysis of this dynamical system is our main technical contribution.

  • Missing reference and comparison with prior work--- Thank you for catching the missing Arnaboldi et. al. reference. We will certainly add this reference to the camera-ready version. We will also add a comparison with and further contextualize the recent works on the spiked random feature model.

  • Multi-pass SGD---We agree that it would be interesting to study multi-pass SGD in this setup, adapting the ideas in Lee et. al. and Arnaboldi et. al.

  • Empirical validation beyond stated framework---Pretraining and transfer learning are routinely employed in modern practice. Our results are an attempt to explain the empirical success of these methods. In this sense, our results provide theoretical insight into empirical phenomena already observed in previous works.

  • Empirical study of effect of correlation at initialization---We are happy to add a simulation study noting the effects of initial correlation on the output of one-pass SGD.

审稿人评论

Thank you to the authors for their detailed and thoughtful responses. I appreciate the clarifications provided. While the rebuttal addresses some of my main concerns, I still find the contribution somewhat incremental relative to prior work and slightly engineered in parts. I therefore maintain my original score.

That said, I note a clear consensus toward acceptance among the reviewers, and I also recommend acceptance.

最终决定

This is a technically solid paper that studies the role of unsupervised pre-training and of transfer learning in the context of the single index model, for which many properties of the learning procedure and hardness can be understood in great detail. This paper obtains numerous interesting results in this direction and constitutes an interesting contribution to ICML.

I would note that similar models were considered in some additional existing literature that is worth including in the related work: E.g. a model similar to the one used for pre-training in https://proceedings.mlr.press/v202/pesce23a/pesce23a.pdf , and another model related to the one used for transfer learning in https://iopscience.iop.org/article/10.1088/2632-2153/ac4f3f