Learning with Exact Invariances in Polynomial Time
A polynomial-time algorithm for learning with exact invariances by kernel regression over manifold input spaces.
摘要
评审与讨论
This paper studied the statistical-computational trade-offs for learning with exact invariances using kernel regression over manifolds, proposed a polynomial-time algorithtm for the first time and proved the generalization error bound of the algorithm by utilizing some tools from differential geometry, spectral theory, and optimization.
优点
Learning with invariance (or other properties of learning problem) is important for both theoretical studies and algorithm design in machine learning. This paper proposes the first computationally efficient algorothm for kernel leanring over manifolds, and introduces some new theoretical tools from differential geometry and spectral theory, which I think might be of independent interest.
缺点
The main weakness of this paper is the generalization bound which is same with learning without invariance, failing to demonstrate the benefit introduced by invariance. Although it has been mentioned as a future work, it would be more interesting if the current algorithm already satisfies this property. In contrast to the bound in [1], the upper bound in this paper not only maintains a larger problem-dependent constant but also a larger dependence on the norm of .
References
[1] Behrooz Tahmasebi and Stefanie Jegelka. The Exact Sample Complexity Gain from Invariances for Kernel Regression. NeurIPS, 2023.
问题
(1) In line 150, whether the expectation operator should be corrected as . I think the expectation is w.r.t. the distribution over manifolds not the trainning data.
(2) The current analysis of the generalization error just used standard techniques, but did not make full use of the invariance. Is the reason that the current analysis is not tight for the proposed algorithm, or that the proposed algorithm did not fully leverage the invariance?
We express our gratitude to the reviewer for their valuable comments and feedback. In the above, we have offered a general response to the reviewers, addressing some of their common concerns about experiments. Here, we provide additional responses to the remaining questions raised.
The main weakness of this paper is the generalization bound which is same with learning without invariance, failing to demonstrate the benefit introduced by invariance. Although it has been mentioned as a future work, it would be more interesting if the current algorithm already satisfies this property. In contrast to the bound in [1], the upper bound in this paper not only maintains a larger problem-dependent constant but also a larger dependence on the norm of .
Thank you very much for mentioning this point. Indeed, it is currently unknown whether the minimax optimal bound of [1] can be achieved by an estimator with polynomial-time computation. We note that, despite the differences in constant terms between our bounds and those in [1], the orders are the same. Furthermore, our bounds match the sample complexity of kernel regression without invariances.
To conclude, we reiterate that the kernel estimator using group averaging [1] is minimax optimal but requires potentially super-exponential time complexity.
In line 150, whether the expectation operator should be corrected as . I think the expectation is w.r.t. the distribution over manifolds not the trainning data.
Thank you for your comment. As you correctly pointed out, we intended to indicate that the expectation is taken over uniformly random points generated from the manifold, not the empirical measure. To align with standard notation and incorporate your suggestion, we have updated our notation in the revised version. Furthermore, we have clarified the definition of the expectation involved in the population risk on Line 150 (highlighted in red) to ensure clarity.
The current analysis of the generalization error just used standard techniques, but did not make full use of the invariance. Is the reason that the current analysis is not tight for the proposed algorithm, or that the proposed algorithm did not fully leverage the invariance?
Thank you for asking this question. We note that the kernel estimator using group averaging [1] is minimax optimal but requires potentially super-exponential time complexity. In contrast, we trade off this statistical optimality by introducing a new algorithm that achieves polynomial time complexity while remaining statistically desirable (though not minimax optimal).
To address the reviewer's concern regarding the full utilization of invariance, we emphasize that our algorithm requires setting a cut-off frequency for estimations, denoted by . The minimax optimal bound can be achieved by setting proportional to , but this compromises the time efficiency of the proposed algorithm. To fully leverage invariance while maintaining polynomial time efficiency, one must optimize the hyperparameter such that it is at most a polynomial in and , subject to non-asymptotic bounds on the low-dimensional spectral sparsity of the group action on the manifold. However, these bounds are not well understood in the applied mathematics literature for arbitrary manifolds, making the problem of optimizing under these constraints both unclear and difficult to address. As a result, we set to a predefined (potentially suboptimal) value that achieves the desired properties of our algorithm. Identifying the optimal , as well as the general optimal statistical-computational trade-off, is left for future work.
Finally, we note that, by construction, our algorithm outputs an exact invariant estimator.
[1] Behrooz Tahmasebi and Stefanie Jegelka. The Exact Sample Complexity Gain from Invariances for Kernel Regression. NeurIPS, 2023.
Thanks for the detailed responses. The explanation on the hardness of this problem has addressed most of my concerns. I disagree that the order of the bound in this paper is the same as that in [1] because the definitions of are different in the two bounds.
I maintain the score and recommend accepting this paper based on its contributions to both algorithm and theory.
[1] Behrooz Tahmasebi and Stefanie Jegelka. The Exact Sample Complexity Gain from Invariances for Kernel Regression. NeurIPS, 2023.
Thank you very much for your reply. We are wondering if there are any additional concerns we should address that might lead to a more positive evaluation of our work in your view.
The authors consider the setting of invariant regression when the regression function is smooth -Sobolev space. It is well known that for , the -Sobolev space is an RKHS and for regression without invariance constraint, the ridge estimator is minimax optimal. As with any kernel method, a forward pass through such an estimator requires at worst time. However, the ridge estimator may not be invariant under the group of interest. In this paper, the authors provide a new estimator that achieves the same error rate as the usual ridge estimator, is invariant under the group of interest, and can be implemented in polynomial time. Note that this estimator may not be minimax optimal as the true regression function with the required constraint lies in a smaller class.
优点
- The paper is well-written and is easy to follow.
- The estimator is pretty natural. While the proof of Theorem 1 is relatively straightforward, I find the idea of re-representing the invariance constraints in the coefficients along basis expansion clever.
缺点
-
I did not find any apparent weakness. The paper delivers on its promise of developing an invariant estimator for Sobolev regression.
-
One might argue that the lack of experiments takes away from the paper, but I am fine with a theory paper with sound and rigorous proof not including experiments. The paper’s conceptual contribution is, in my view, sufficient to address this limitation.
问题
In Section 3, the authors mention that group averaging can be too costly for large groups. But what about group averaging only over the generator? Does the kernel ridge regression with the group averaging over the generator set yield an invariant estimator? If so, what do we know about its generalization error?
We sincerely appreciate the reviewer's thoughtful and constructive comments. In our general response, we address some common concerns regarding the experiments. Here, we provide additional responses to the remaining specific concerns.
I did not find any apparent weakness. The paper delivers on its promise of developing an invariant estimator for Sobolev regression.
Thank you very much for your positive feedback on the results!
One might argue that the lack of experiments takes away from the paper, but I am fine with a theory paper with sound and rigorous proof not including experiments. The paper’s conceptual contribution is, in my view, sufficient to address this limitation.
Thank you for your comment. Given the reviewers' suggestions to evaluate the proposed algorithm through experiments, we decided to include experiments in the revised version of the paper. Please refer to our general response for details on the experiments.
In Section 3, the authors mention that group averaging can be too costly for large groups. But what about group averaging only over the generator? Does the kernel ridge regression with the group averaging over the generator set yield an invariant estimator? If so, what do we know about its generalization error?
Thanks for your valuable comment. We note that averaging over a generator set does not yield an invariant estimator. For example, let us consider the case of linear kernel along with the permutation group acting via permuting the coordinates of vectors. Note that and are two permutations generating , and if we average the kernel over these permutations it yields , where is defined as:
Clearly, the resulting kernel is not permutation-invariant and, therefore, does not necessarily produce invariant estimators (it is worth noting that it is not even a PSD kernel). Consequently, generating sets are not useful for simplifying the group averaging process involved in constructing invariant kernels.
Once again, we thank the reviewer for raising this question, and we hope this example addresses their concern.
I want to thank the authors for answering my question. I will maintain my score.
Thank you very much for your reply. We are wondering if there are any additional concerns we should address that might lead to a more positive evaluation of our work in your view.
The authors show that it is possible to learn with invariance on compact smooth manifolds with a finite group action in a manner that simultaneously enjoys polynomial sample complexity and computational complexity in the problem parameters.
优点
The paper is well-written and flows well. The exposition is very clear. The authors also highlighted the novelty and the gap that is filled. The proof sketch is nice and I find that conveys the essential technical aspects.
缺点
I think the first bullet point on Line 085 could be made more clear: the main results of this paper shows one point on the statistical-computational trade-off curve, as I understand. I don't see how to interpret the results as showing how to trade-off statistical efficiency to gain computational-efficiency. I might be misunderstanding this bullet point however.
问题
Is the case of (positive dimensional) lie group actions addressable using this approach?
Theorem 1 should make it clear that is -invariant.
Line 234: how about the second oracle (the one about computing inner product between shifted eigenfunctions). Is that also efficiently computable?
Are there other examples besides the sphere for which the oracle calls can be computed efficiently? E.g., if the authors can talk about an example with Stiefel manifolds, that would be nice to know about.
We sincerely appreciate the reviewer's recognition of the significance of the contributions made by our work, as well as the constructive comments provided. Please let us know if there are any other concerns that need to be addressed.
I think the first bullet point on Line 085 could be made more clear: the main results of this paper shows one point on the statistical-computational trade-off curve, as I understand. I don't see how to interpret the results as showing how to trade-off statistical efficiency to gain computational-efficiency. I might be misunderstanding this bullet point however.
Thank you for your valuable comment! Here, we trade off computational complexity against statistical complexity using the parameter , which represents the number of eigenspaces utilized in the estimator (equivalent to the number of convex quadratic programs solved). A larger improves statistical efficiency but increases computational cost.
For the primary goal of this paper, i.e., achieving an efficient invariant estimator in polynomial time, selecting is sufficient. Please refer to Remark 8 in the manuscript for a short discussion.
We have added a highlighted line (Line 91) to further clarify this point.
Is the case of (positive dimensional) lie group actions addressable using this approach?
Thank you for your question. In this version of the paper, our primary focus is on finite groups, where group averaging can be computed in finite (though potentially prohibitively large) time. We have left the extension to positive-dimensional Lie groups as a direction for future work.
It is important to note that for infinite groups, proposing an algorithm that computes group averaging in finite time is not immediately feasible (and currently unknown) due to the infinite number of group transformations involved in the averaging process.
Theorem 1 should make it clear that is -invariant.
Thanks for mentioning this! We made necessery changed in Line 249 in the revised version (highlighted in red).
Line 234: how about the second oracle (the one about computing inner product between shifted eigenfunctions). Is that also efficiently computable?
Yes, for the case of (harmonic) polynomials, as long as they have bounded degree (Line 233), one can compute their inner product using polynomial multiplication algorithms (even naive algorithms run in polynomial time for this problem, though faster ones, such as those based on the Fast Fourier Transform, exist in the literature). Therefore, both oracles can be answered in polynomial time. We have also applied these clarifications to the revised version (Line 235, highlighted in red).
Are there other examples besides the sphere for which the oracle calls can be computed efficiently? E.g., if the authors can talk about an example with Stiefel manifolds, that would be nice to know about.
Thanks for mentioning this interesting example! Yes, the oracle calls, for instance, can also be computed over tori (), which we did in the newly added experimental section. For the Stiefel manifold, we note that its eigenfunctions form a linear subspace of (harmonic) polynomials. Therefore, the discussion provided for spheres also applies to the Stiefel manifold, as its eigenfunctions can similarly be expressed as low-degree polynomials.
Thank you very much for your constructive review and positive feedback. We are wondering if there are any additional concerns we should address that might lead to a more positive evaluation of our work in your view.
This paper proposed a computationally efficient algorithm for learning with exact invariances over RKHS. Specifically, on one hand, the authors demonstrated that the proposed algorithm only has polynomial-time complexity, in sharp contrast to the previous algorithm with invariances. On the other hand, an advantage compared to KRR is that the proposed algorithm can enforce group invariances without loss in statistical convergence rate.
优点
-
The paper is well-written and easy to follow.
-
A novel algorithm is proposed with two main advantages: computational efficiency and the enforcement of group invariances.
-
The theoretical analysis in this paper is comprehensive and rigorous. The statistical rate of the proposed algorithm is provided, along with detailed comparisons to the related algorithm.
缺点
Writing:
The dimension first appears before it is defined. I suggest the authors introduce the definition of earlier for clarity.
In Line 144, the labeled samples are drawn from the product manifold rather than from .
Numerical experiments:
While I find the theoretical aspect interesting and well-founded, it may be beneficial to conduct simple numerical experiments to support the improvement of the proposed algorithm in computational efficiency and enforcing group invariances, while preserving the same learning rate as the original KRR.
问题
One concern I have is about the computational complexity comparison between the proposed algorithm and the original. You claim that your algorithm reduces the time cost from the super-exponential in to the polynomial time. However, to my knowledge, the standard KRR method can only solve the low-dimensional task, where the dimension is typically small and can be neglected compared to the number of sample points.
I also note that the previous algorithm has computational complexity of (if I can ignore ), while your algorithm lower the complexity to . I suggest that the authors carefully discuss this issue.
We express our gratitude to the reviewer for their valuable comments and feedback. In the above, we have offered a general response to the reviewers, addressing some of their common concerns about experiments. Here, we provide additional responses to the remaining questions raised.
Writing: The dimension first appears before it is defined. I suggest the authors introduce the definition of earlier for clarity.
Thank you for your comment. To address it, we have added a sentence clarifying the role of in the revised version (highlighted in red, Line 133).
In Line 144, the labeled samples are drawn from the product manifold rather than from .
The data points are sampled from , while the labels are sampled from . Consequently, the tuples are drawn from the product space (Line 144).
Numerical experiments: While I find the theoretical aspect interesting and well-founded, it may be beneficial to conduct simple numerical experiments to support the improvement of the proposed algorithm in computational efficiency and enforcing group invariances, while preserving the same learning rate as the original KRR.
Please refer to our general response for the experiments.
One concern I have is about the computational complexity comparison between the proposed algorithm and the original. You claim that your algorithm reduces the time cost from the super-exponential in to the polynomial time. However, to my knowledge, the standard KRR method can only solve the low-dimensional task, where the dimension is typically small and can be neglected compared to the number of sample points.
Thanks for your valuable comment! We answer your question from two different perspectives, computational and statistical.
From a computational viewpoint, in the original problem of KRR (without invariances), there are two terms contributing to the computational complexity. The dominating term is , given that the evaluation of the Gram matrix can be done efficiently. This complexity term arises because, once all the entries of the Gram matrix for all data points , , are calculated, they only need to be plugged into the closed-form solution of the KRR estimator, which involves a matrix inversion requiring computations.
The other term is the computational complexity of calculating the Gram matrix . In practice, for most of the popular kernels, this calculation is computationally efficient. For example, for dot-product kernels (e.g., Gaussian or RBF kernels, polynomial kernels, and Matern kernels for Sobolev spaces), one only needs to evaluate the inner product between two -dimensional vectors, which requires only time. Therefore, the time complexity required to calculate the Gram matrix is . Thus, this term is usually dominated by the term whenever .
Putting these pieces together, as long as the kernel can be evaluated in polynomial time in , one can compute the KRR estimator in polynomial time with respect to the number of samples and the input space dimension . However, in learning with exact invariances, one additional factor comes into play: the size of the group , which can be superexponential in the input dimension (for example for the case of permutation invariances). For instance, if we use group averaging, for each pair of data points , , we need to compute the invariant Gram matrix . This step has a complexity of order . Compared to the case of KRR without invariances, this term dominates as long as , which is usually the case (the number of observations does not grow beyond superexponential dependence on the input dimension ). Additionally, we note that, from the perspective of complexity theory, the complexity is not appealing, as it is not polynomial in the parameters of problem statement, i.e., parameters and .
Second, from a statistical viewpoint, the non-asymptotic convergence guarantees for the KRR problem depend on the number of observations and the spectral properties of the chosen kernel, e.g., the spectral decay rate. In many cases, these parameters do not have a direct dependence on the dimension of the input space . In many other settings, choosing a number of observations () that is polynomially large in is sufficient for small generalization errors. For example, in the case of polynomial regression of order , choosing of order leads to small estimation errors. Consequently, the computational/statistical efficiency of KRR estimators with small errors, especially in the case of learning with exact invariances, remains an interesting and valid problem in terms of both parameters and .
Once again, we thank the reviewer for raising this important question, and we hope the discussion above addresses their concern.
I also note that the previous algorithm has computational complexity of (if I can ignore ), while your algorithm lower the complexity to . I suggest that the authors carefully discuss this issue.
Thank you for pointing out this question. To clarify, we note that the computational complexity that we report on Theorem 1 (main result) is conditional on the oracle calls that we use to construct the estimator. More precisely, we compute the estimator in time using oracle calls.
Calculating the KRR estimator (without invariances) requires calls to an oracle that evaluates the kernel at pairs of data points. Consequently, the two algorithms rely on different types of oracles and are not directly comparable in terms of total time complexity without considering the specific algorithm used to process the oracle calls.
In our proposed framework, the oracles involve evaluations of eigenfunctions and inner products of eigenfunctions. For most cases, these oracles can be answered using efficient algorithms. For example, in hyperspheres the eigenfunctions correspond to bounded-degree (harmonic) polynomials (Line 232, highlighted in red). Therefore, oracle calls can be answered in polynomial time (more efficient algorithms, such as those based on the Fast Fourier Transform (FFT), exist for computing polynomial products but for the purposes of this paper, even a naive approach suffices to ensure that oracle calls are answered in polynomial time).
Thank you for your response. I would keep my score and lean to accept this paper.
Thank you very much for your reply. We are wondering if there are any additional concerns we should address that might lead to a more positive evaluation of our work in your view.
We appreciate the reviewers for their constructive feedback and for recognizing the technical merits and contributions of our work. During the rebuttal phase, we have revised the manuscript and highlighted the changes in red.
Below, we provide a general response to the common question about experiments raised by the reviewers.
Experiments: The reviewers requested the implementation of the algorithms discussed in the paper. In the revised version, we have included an experimental section to validate the practical performance of the proposed algorithm (Spec-Avg) and to compare it with Kernel Ridge Regression (KRR). The new results are presented in the updated manuscript (Appendix C, highlighted in red).
We now summarize the experimental results and refer the reviewers to the paper for further details. In our setup, we consider regression over -dimensional Tori, represented as , with a sign-invariant target function (i.e., a function invariant with respect to the group of sign transformations , acting on by coordinate-wise sign inversion).
We report the performance of our proposed algorithm and {KRR} for i.i.d. samples () under different regularization parameters (for our algorithm) and (for KRR). Our main finding is that the proposed algorithm learns the target function with a desirable test loss, comparable to the KRR estimator. Moreover, we also report an invariance discrepancy (defined in the paper, Appendix C), which measures the closeness to invariance for any estimator. For our algorithm, this value is zero, as the estimator is invariant by construction. We observe that KRR, in contrast, does not necessarily produce invariant estimations in practice.
This paper aims to learn with exact invariances using kernel regression over manifold input spaces. Under the assumption that the geometric properties of the input space can be accessed, the paper proposed a new algorithm that learns a classifier with exact invariances in polynomial time, while retaining the excess population risk as the original kernel regression.
The paper is well written, and the contributions are clear. Although the generation error is not improved by the invariance, the minimax analysis provided in the rebuttal alleviates the concern. My major concern, however, is the applicability of the proposed method in general machine learning. I totally understand that the paper’s major contributions are theoretical. However, the two assumptions on the oracle appear rather demanding in real applications - what can be guaranteed when the Laplace–Beltrami operator is approximated by the graph Laplacian? The examples given in the paper are rather simplistic (unit sphere), making it quite difficult to see how the method can have a significant impact on the real practice of learning with exact invariance.
审稿人讨论附加意见
The rebuttal has been noted by the reviewers and have been taken into account by the AC in the recommendation of acceptance/rejection.
We are surprised by this decision!
Our work was placed in the learning theory area, so we expected it to be evaluated accordingly. As the title suggests, we focus on the computational complexity of learning with exact invariances. By "exact," we mean the underlying manifold is known, so no approximation is needed. Therefore, the Area Chair’s comment on “approximating the Laplace–Beltrami operator via the graph Laplacian” is less relevant—exact invariances are impossible if the manifold is only approximated.
Our “oracle” assumption is general and naturally applies to classic manifolds such as spheres, tori, and Stiefel manifolds. We also provided concrete experiments showing that practicality of our algorithm. Additionally, harmonic analysis, a well-established field of mathematics, primarily studies functions on spheres. This makes the claim that our examples (e.g., the unit sphere) are “simplistic” seem unwarranted.
We hope this clarifies any misunderstandings.
Reject