The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models
We define a "generative leap" exponent which tightly captures the sample complexity for efficiently learning Gaussian multi-index models.
摘要
评审与讨论
This paper introduces the generative leap exponent, extending the generative exponent from single-index to multi-index Gaussian models. The authors establish a computational lower bound of under the Low-Degree Polynomial (LDP) framework for agnostically learning the hidden subspace. They also propose a sequential spectral estimator which achieves a matching upper bound.
优缺点分析
The paper is theoretically sound and well organized. It is significant to find the tight sample complexity for Gaussian multi-index models and the authors propose a novel algorithm to match the sample complexity. Some sections could be more accessible, especially the leap decomposition which is quite dense, but in general core claims are well-supported.
问题
-
L115: k-th Hermite tensor uses bold type here but not in the following texts. The authors can also emphaize that represents Gaussian distribution.
-
L150: What does mean when is a tensor?
-
It might be good to explain the meaning of generative leap exponent, as in L175-177 for generative exponent to better understand their difference. The authors can also explain the leap decomposition (L149) in more details to understand the intuition and difficulty to generalize the leap exponent from single index models to multi-index models.
-
In Proposition 4, why do we only have the upper bound of but not its value?
-
The authors mention that GD needs samples. Is it suboptimal?
局限性
yes
最终评判理由
This is a strong paper and I recommend acceptance.
格式问题
no
We sincerely thank the reviewer for their thoughtful feedback and address each point below.
L115: k-th Hermite tensor uses bold type here but not in the following texts. The authors can also emphaize that represents Gaussian distribution.
Thank you for catching these – we will correct them for the camera ready.
L150: What does mean when is a tensor?
We apologize for the confusion and have added a section to the paper clarifying our tensor notation, including the definition of . In general we can define the span of a tensor at an index by:
\mathrm{span}\_l(\Gamma) = \left\\{\Gamma[v\_1,\ldots,v\_{l-1},\cdot,v\_{l+1},\ldots,v\_{2k}] | v\_{1,\ldots,2k} \in S^\perp\right\\}This is equivalent to the span of the flattened version of where the -th index is pulled out:
Due to the symmetries in , is independent of so we simply write . In this case it's simplest to think of reshaping as an element of and then computing the span of this matrix.
It might be good to explain the meaning of generative leap exponent, as in L175-177 for generative exponent to better understand their difference. The authors can also explain the leap decomposition (L149) in more details to understand the intuition and difficulty to generalize the leap exponent from single index models to multi-index models.
We agree with the reviewer's recommendation and will include some explicit examples for the filtration corresponding to the generative leap and how it differs from the information leap. We believe that a good example to highlight is . The filtration for the generative leap is since we can apply a label transformation to lower the exponent in the direction to (for example works here), and then condition on to learn . This corresponds to two leaps of size . On the other hand, the filtration for the information leap is just directly since directly learning is leap 4 so the learner is forced to learn both coordinates from , which is a single leap of size .
In Proposition 4, why do we only have the upper bound of but not its value?
Proposition 4 uniquely determines for Gaussian -parity and for staircase functions. For intersections of halfspaces and general polynomials, we showed and this depends on the specific halfspaces/polynomials. For example, in the single-index setting, for any polynomial unless it is even, i.e. , in which case . Thus, has while has .
The authors mention that GD needs samples. Is it suboptimal?
We believe the reviewer is referring to our comment on line 50. Our goal here was simply to set the stage that the sample complexities in this line of work are of the form where the exponent depends on both the algorithm and the link function (e.g. for online SGD and for smoothed online SGD). We unfortunately re-used the letter in the manuscript for the generative exponent which may have led to confusion in line 50 so we will rename for this line. We are still unsure what the optimal rate for gradient descent is and whether it can achieve sample complexity with batch reuse for general multi-index models.
Thank you for your responses. I'm happy to keep my score.
This paper studies the complexity of learning Gaussian multi-index models, where the authors generalize the notion of the generative exponent for single-index models to generative leap exponent. They prove that this exponent controls the low-degree-polynomial lower bound for sample complexity, and prove a polynomial-time algorithm with the same sample complexity as a function of . They thus show that the generative leap exponent provides a tight characterization of the sample complexity for efficient learning of Gaussian multi-index models in high dimensions. The authors also establish links between the generative and information leap exponents, and show that their framework covers several examples of popular Gaussian multi-index models studied in the literature.
优缺点分析
This is a technically strong paper with many novel and interesting results, and provides a complete picture of the complexity of learning Gaussian multi-index models, a problem that has received a lot of attention from the community as an example where learners can adapt to a latent low-dimensional structure in a high-dimensional ambient space and perform feature learning with non-linear predictors.
The paper is generally well-written and easy to read, but there are a few points that can improve the readability. In particular, I think that the readers could have a better intuition of the constructions in Section 2 if the authors present the example of staircase functions and how it leads to concrete instantiations of Definitions 3, 4, and Proposition 1. Also the authors could explain a bit more about certain notations, discussed below.
问题
-
What does mean for a tensor ?
-
I believe the union in Equation (1) is meant to be direct sum? In particular live in different subspaces, I’m not sure how one can take unions of them and end up with a new subspace.
-
I’m a bit confused about Proposition 2. It seems that acts on , then shouldn't the push forward map be instead of ?
-
From my understanding, not every subspace of the index space is seen when defining the generative leap. However, for the lower bound of Section 3, are we free to choose any subspace of ? Then would we be able to choose a subspace whose leap exponent is even larger than the one seen during Definition 3? I don’t see why we would necessarily pick the one that corresponds to the maximizer of Definition 3.
-
The authors mention in the paragraph starting from Line 25 that to achieve the information-theoretically optimal sample complexity of , one needs to have a brute force estimator over an -net which will necessarily have an exponential computational complexity in high dimensions. However, if we assume the Gaussian multi-index model has a structured covariance, it is possible to obtain an improved computational complexity (even sub-exponential) while maintaining the same linear sample complexity (in some lower effective dimension). [1] presents an example of how gradient-based learning can achieve this computational adaptivity for multi-index models (which is absent from fixed-grid optimization over nets). It is generally interesting to know how the picture here would change once we have a structured covariance, which is known to improve the dependence on information exponent for algorithms in the correlational statistical query class [2, 3].
-
The discussion in Section 4.2 can of course be extended to non-polynomial time algorithms. Without any assumptions on , learning the regression function when depends on directions could have a minimax optimal rate that looks like .
-
The authors may want to use in the 3rd line of Algorithm 1.
References:
[1] A. Mousavi-Hosseini et al. "Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics." ICLR 2025.
[2] J. Ba et al. "Learning in the presence of low-dimensional structure: a spiked random matrix perspective." NeurIPS 2023.
[3] A. Mousavi-Hosseini et al. "Gradient-Based Feature Learning under Structured Data." NeurIPS 2023.
局限性
yes
最终评判理由
Per my original review this is a strong contribution and all reviewers are in agreement.
格式问题
No concerns.
We sincerely thank the reviewer for their thoughtful feedback and address each point below.
The paper is generally well-written and easy to read, but there are a few points that can improve the readability. In particular, I think that the readers could have a better intuition of the constructions in Section 2 if the authors present the example of staircase functions and how it leads to concrete instantiations of Definitions 3, 4, and Proposition 1.
We agree with the reviewer's recommendation and will include some explicit examples for the filtration corresponding to the generative leap and how it differs from the information leap. We believe that a good example to highlight is . The filtration for the generative leap is since we can apply a label transformation to lower the exponent in the direction to (for example works here), and then condition on to learn . This corresponds to two leaps of size . On the other hand, the filtration for the information leap is just directly since directly learning is leap 4 so the learner is forced to learn both coordinates from , which is a single leap of size .
What does mean for a tensor ?
We apologize for the confusion and have added a section to the paper clarifying our tensor notation, including the definition of . In general we can define the span of a tensor at an index by:
\mathrm{span}\_l(\Gamma) = \left\\{\Gamma[v\_1,\ldots,v\_{l-1},\cdot,v\_{l+1},\ldots,v\_{2k}] | v\_{1,\ldots,2k} \in S^\perp\right\\}This is equivalent to the span of the flattened version of where the -th index is pulled out:
Due to the symmetries in , is independent of so we simply write . In this case it's simplest to think of reshaping as an element of and then computing the span of this matrix.
I believe the union in Equation (1) is meant to be direct sum? In particular live in different subspaces, I’m not sure how one can take unions of them and end up with a new subspace.
Yes, you are correct that equation (1) should be a direct sum. We used to represent at various points throughout the paper and will correct this for the camera ready.
I’m a bit confused about Proposition 2. It seems that acts on , then shouldn't the push forward map be instead of ?
The pushforward map should indeed read , rather than . Thank you for catching this typo, we will correct it in our next revision.
From my understanding, not every subspace of the index set is seen when defining the generative leap. However, for the lower bound of Section 3, are we free to choose any subspace of ? Then would we be able to choose a subspace whose leap exponent is even larger than the one seen during Definition 3? I don’t see why we would necessarily pick the one that corresponds to the maximizer of Definition 3.
Theorem 1 is stated for picking the subspace that corresponds to the filtration just before the leap corresponding to the generative leap (specifically, a choice of that is seen in Definition 3). However we could easily generalize it to the following statement: for any subspace , distinguishing requires samples. As was chosen such that , this would imply Theorem 1.
You are indeed correct that we could apply this to any subspace , including those not seen during Definition 3. Arguing that samples are necessary for estimation, therefore follows from this inductive argument:
- At the start you have no information (), and you can prove that you need samples to detect the planted model, and therefore to estimate any additional directions.
- If the learner has more than samples, we will generously provide them with all of . Given , detecting the planted model learning any additional directions requires samples.
- ...
This induction shows that recovering any direction in requires samples and since , it demonstrates that recovering the index set requires samples. We will add the generalized version of Theorem 1 in terms of and the discussion connecting estimation with this series of decision problems to Section 3.
The authors mention in the paragraph starting from Line 25 that to achieve the information-theoretically optimal sample complexity of , one needs to have a brute force estimator over an -net which will necessarily have an exponential computational complexity in high dimensions. However, if we assume the Gaussian multi-index model has a structured covariance, it is possible to obtain an improved computational complexity (even sub-exponential) while maintaining the same linear sample complexity (in some lower effective dimension). [1] presents an example of how gradient-based learning can achieve this computational adaptivity for multi-index models (which is absent from fixed-grid optimization over nets). It is generally interesting to know how the picture here would change once we have a structured covariance, which is known to improve the dependence on information exponent for algorithms in the correlational statistical query class [2, 3].
We definitely agree with the reviewer that the picture gets much more interesting when there is additional structure in the input data. The generative leap directly extends to the setting of non-isotropic covariances so long as this covariance is fixed in advance. For example, for a single index model if and is chosen such that , then we can just transform and then will be isotropic and will be uniform on the sphere so the generative exponent still applies. However, if and are correlated (as in [2,3]), then the learner can certainly leverage this information to learn more efficiently. We will add a section in the paper discussing possible extensions of our framework to more structured inputs.
The discussion in Section 4.2 can of course be extended to non-polynomial time algorithms. Without any assumptions on , learning the regression function when depends on directions could have a minimax optimal rate that looks like .
We agree that the minimax rate for estimation is . However, it's not immediately clear that this is necessary for subspace recovery. One could potentially hope that subspace recovery is an easier task. The purpose of section 4.2 is to give an example where it is actually necessary to find the right function on before you can apply the right label transformation to identify .
The authors may want to use in the 3rd line of Algorithm 1.
Could you clarify which line you are referring to? Are you referring to the line in Algorithm 2?
Thank you for the detailed responses. I think this is a strong contribution and I'm happy to keep my score.
Could you clarify which line you are referring to? Are you referring to the line in Algorithm 2?
I was referring to the fact that we are taking the top eigenvectors in Algorithm 1, but then I noticed this is already being pointed out in the output of the algorithm.
The paper considers multi-index models with Gaussian inputs in high dimensions and a finite number of indices. For this model, the authors show that one can estimate the hidden subspaces in the model with a sample complexity , where , the generative leap, depends on the model. This is a necessary and sufficient condition.
优缺点分析
The paper is detailed, clear and extremely well written. The proofs are clear, and the technique is interesting, although similar in essence to the one for generative exponents. The main contribution of this paper, the description of the generative leap, is a fundamental contribution to the field and will surely have a big impact in the theory community. As the authors correctly state, most cases of practical interest have generative exponent less then or equal to 2, which were already studied in several previous works, so its fundamental limitation lies in the limited applicability of the results.
问题
- In 4.2 you discussed a fine-grained analysis, in the sense of finding the minimal constant such that the hidden subspaces can be recovered. In particular you state that the dependance on the number of indices can be exponentially bad. Can you provide a more comprehensive classification?
- In 5.3 you state that while shallow networks have a generative leap at least as large as the generative exponent, wider networks might not have this bound. Is there a practical example of this with some activations?
- Is there some way to salvage any of this theory in the case of indices?
局限性
Yes
最终评判理由
I believe this to be an incredibly solid paper. My doubts were cleared by the rebuttal and I happily raised my score to full acceptance.
格式问题
No concerns
We sincerely thank the reviewer for their thoughtful feedback and address each point below.
In 4.2 you discussed a fine-grained analysis, in the sense of finding the minimal constant such that the hidden subspaces can be recovered. In particular you state that the dependance on the number of indices can be exponentially bad. Can you provide a more comprehensive classification?
The tight constant will depend significantly on whether or not the learner has knowledge of the link function. For example, when , [Troiani et al. 2024] computed the tight constants for AMP, and we believe it is likely that a more careful low-degree polynomial lower bound would match these constants. Tight constants do not generally exist for since runtime and sample complexity can be continuously traded off. However, we can compute the exact weak-recovery threshold for our matrix U-statistic estimator when the optimal label transformation is used. For Gaussian -parity, this constant is:
which scales like . In particular, for Gaussian -parity, this constant is exponentially small in .
More generally when the learner does not have knowledge of the link function, the constants (and dependencies) will generally be worse, and the constants will depend on the prior over link functions.
In 5.3 you state that while shallow networks have a generative leap at least as large as the generative exponent, wider networks might not have this bound. Is there a practical example of this with some activations?
The key difference between Proposition 6 and Proposition 7 is that the weights are restricted to be orthogonal in Proposition 6. In addition, we apologize for a typo in Proposition 7 which we have fixed in a revision – the corrected statement is that for any non-polynomial and integer , there exists a shallow neural network with .
Is there some way to salvage any of this theory in the case of indices?
Our paper is focused on the setting in which and . When , the function is in a sense "generic" since it can depend on almost all of the coordinates, and getting learning guarantees in this setting would require imposing additional assumptions on . However, it may be possible to salvage the theory when for some . This would require developing more fine-grained estimates of the -dependencies in both our upper and lower bounds. For the tight constants (assuming knowledge of the link function) were computed in [Troiani et al. 2024], however we leave extending this to larger to future work.
Thanks for your detailed answer. I am happy to raise my rating.
The authors investigated the computationally optimal sample complexities to learn the intrinsic features of multi-index models.
The authors first generalized the generative exponent (from single index models) to generative leap exponent with leap decomposition, which relates to the nested easy-to-hard subspace structure & leap exponent.
Then they characterized the hardness of learning a multi-index models by proving that a sample complexity of is essential for the low-degree polynomial test to have non-trivial power in detecting the planted subspace.
For a matching upper bound, the authors propose to use a U-statistics that resembles the unfolding technique in the multi-spike tensor PCA and proved that is essential for this algorithm to learn the planted direction.
优缺点分析
Pros
- The paper is mathematically rigorous and well organized.
- The content is quite complete and novel, by fully settling the low-degree detection lower bound and a matching upper bound. Several examples are provided to help readers with understanding the notion of generative leap and the hardness of the multi-index models.
- The proof sketch for the upper bound is very clean, easy to follow and sufficient to cover the main idea.
Cons
- The layout, definitions and results seem to be a bit dense for readers.
问题
- In the upper bound algorithm, you proposed a kernel-based algorithm that implements the label transformation of potentially infinite order and the conditions for the kernel seems to be very weak. However, in the problem of learning single-index models, batch-reusing / explicit label transformation are deployed to go beyond the CSQ queries[1,2] and certain conditions are required for general gradients to attain the optimal sample complexities [e.g. Assumpion 4.1 (b), 2]. Does this kernel trick automatically select such label transformations? Or is there a way that we can use this method in iterative algorithms to achieve optimal guarantee?
- Does the sub-space filtration (or flag, in (1)) from the generative leap exactly equal to the filtration defined for the information leap? To put in other way, does the direction that is easier to learn in the notion of generative exponent necessarily easier to learn in the notion of information exponent?
[1] Lee, Jason D., Kazusato Oko, Taiji Suzuki, and Denny Wu. "Neural network learns low-dimensional polynomials with sgd near the information-theoretic limit." Advances in Neural Information Processing Systems 37 (2024): 58716-58756.
[2] Chen, Siyu, Beining Wu, Miao Lu, Zhuoran Yang, and Tianhao Wang. "Can neural networks achieve optimal computational-statistical tradeoff? an analysis on single-index model." In The Thirteenth International Conference on Learning Representations. 2025.
局限性
yes
格式问题
No.
We sincerely thank the reviewer for their thoughtful feedback and address each point below.
In the upper bound algorithm, you proposed a kernel-based algorithm that implements the label transformation of potentially infinite order and the conditions for the kernel seems to be very weak. However, in the problem of learning single-index models, batch-reusing / explicit label transformation are deployed to go beyond the CSQ queries[1,2] and certain conditions are required for general gradients to attain the optimal sample complexities [e.g. Assumpion 4.1 (b), 2]. Does this kernel trick automatically select such label transformations? Or is there a way that we can use this method in iterative algorithms to achieve optimal guarantee?
Yes, this kernel trick automatically selects such label transformations. In fact, what this implicitly relies on is that it suffices to use a random label transformation. For example:
-
uniformly sample , and then threshold the labels at the -th quantile, i.e. if then and (min CDF kernel)
-
transform where and (RBF kernel)
Either of these choices would succeed with probability . Our kernel estimator can be interpreted as "averaging" over the randomness in such a label transformation. Therefore, to use this method in an iterative algorithm (e.g. batch re-use) it suffices to show that the dynamics apply a sufficiently random label transformation to the labels. While this has been analyzed in the single-index setting (e.g. [1,2] above), we are not aware of any results that imply learnability of general multi-index functions using gradient descent given samples where is the generative leap.
Does the sub-space filtration (or flag, in (1)) from the generative leap exactly equal to the filtration defined for the information leap? To put in other way, does the direction that is easier to learn in the notion of generative exponent necessarily easier to learn in the notion of information exponent?
No, in general these filtrations can be quite different. As a simple example, consider:
The filtration for the generative leap is since we can apply a label transformation to lower the exponent in the direction to (for example works here), and then condition on to learn . This corresponds to two leaps of size .
On the other hand, the filtration for the information leap is just directly since directly learning is leap 4 so the learner is forced to learn both coordinates from , which is a single leap of size .
Thanks for your in-depth comments! This settled my questions. I will keep my scores unchanged
The goal of the paper is the analysis of learning in generic Gaussian multi-index models. The investigation generalises the notion of generative exponent to generative leap exponent, proving that this quantity is crucial in the scaling of the sample complexity scaling for learning. The paper collected unanimous positive feedbacks: its novelty and soundness have been appreciated by all Reviewers, in particular with respect to the introduction of the aforementioned generative leap exponent as informative complexity measure. As considerable effort has been put by the theoretical community in the study of single index and multi index model, the paper provides a significant step forward in such a line of research and might be a reference for further investigations on such a prototipical setups. I therefore recommend its acceptance in NeurIPS, possibly as spotlight.