PaperHub
6.8
/10
Spotlight5 位审稿人
最低3最高4标准差0.5
4
4
3
3
4
ICML 2025

Learning with Exact Invariances in Polynomial Time

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-16
TL;DR

We propose a polynomial-time algorithm for learning exact invariances through kernel regression on manifold input spaces.

摘要

关键词
Learning with InvariancesKernelsSpectral Theory

评审与讨论

审稿意见
4

Building from results on Riemannian manifolds, group theory, and the spectral properties of the Laplace-Beltrami operator, the authors propose a learning algorithm to minimize the population risk in the class of Sobolev space (s times differentiable for s >= 2d where d is the dimension of the input space, here a Riemannian manifold) under the constraint that the returned estimator achieves exact invariance with respect to a (provided) group of transformations acting on the input space. Thanks to a nice observation on group invariance presented on page 5 (just before Definition 5.1) and from the fact that the size of the group generator is bounded by the logarithm of the group size, the algorithm scales only logarithmically with the group size and polynomially with all other relevant parameters such as d and the number n of training examples. Moreover, the authors argue that the Sobolev regression function can be approximated well by a superposition of at most D eigenvectors (of the Laplace-Beltrami operator) for some reasonable D. As a consequence, the authors seem to claim that their algorithm "learns" (in polynomial time) a good approximation of the Sobolev regressor which achieves exact invariance.

给作者的问题

What happens if G does not form a group? For example if you translate an image, it will eventually fall off the boundary and the inverse translation does not exists. This happens often. How can you address this problem?

论据与证据

The claim that the proposed algorithm produce a predictor that achieves exact invariance in time logarithmic in the size of the symmetry group is supported by two results: a nice property of groups (provided on page 5) and fact that the size of the group generator is bounded by the logarithm of the group size (Appendix C1).

Is the returned predictor f' (truncated to at most D eigenvectors) is a "good" approximation of the optimal Sobolev regressor f* ? There are no equations that show precisely how close f' is to f*. The third equation on page 6 does provide important information on this missing link but it does not provide all the information. It seems to me that the learning objective should be instead to find an invariant f' that should be as close as possible to the best invariant regressor f**. This f** is the one satisfying the optimization problem given by the last equation on page 6. Consequently, it would be nice if the authors could establish an upper bound on E_S || f' - f** ||^2_L2 as a function of s,d, and n. I think this is actually what is missing in this paper. However, I think that the contributions of this paper are already substantial enough to be presented at ICML.

方法与评估标准

This is a theoretical paper and the presented theory is nice an relevant; but very complicated to understand for those like me who have very little background on Riemannian geometry. Why do we need to treat the input space as a compact boudaryless Riemannian manifold? Why not just a compact subset of R^d ? This would make your paper more accessible to the ICML audience. Do you need to consider the input space X as a Riemannian manifold because of the group action on X? I think the authors should provide the motivation for considering Riemannian minifolds.

理论论述

I was able to verify the claim on "reducing the number of constraint" on page 5 and the property proven in appendix C1. I was able to follow sections 3, 4, and 5 but it took me quite a while... However, many of the definitions, lemmas and theorems in the appendix are just to advanced for me. I would need much more time to understand them.

实验设计与分析

The small experiment seems OK and supports the theoretical claims and results.

补充材料

I have read B8, B9, and C1. Much of the supplementary material goes over my head!

与现有文献的关系

The paper nicely position itself in the literature on learning under invariance.

遗漏的重要参考文献

You should provide a citation for the result of the expected population risk (in fact, the expected excess risk) of the KRR estimator on page 3.

Otherwise, I have not identified any other missing references.

其他优缺点

The book of Sholkopf ans Smola was published by MIT Press in 2002, not 2018.

其他意见或建议

There are a few conflicting notations. In the supplementary material, g is used to represent a group element and the metric tensor!

\eta is the regularizer of the KRR of Equation 8. But that regularizer is referred to as \lambda on page 8.

作者回复

We appreciate the reviewer’s recognition of our work’s merits and their valuable comments.

The claim that the proposed algorithm produce a predictor that achieves exact invariance in time logarithmic in the size of the symmetry group is supported by two results: a nice property of groups (provided on page 5) and fact that the size of the group generator is bounded by the logarithm of the group size (Appendix C1). Is the returned predictor f' (truncated to at most D eigenvectors) is a "good" approximation of the optimal Sobolev regressor f* ? There are no equations that show precisely how close f' is to f*. The third equation on page 6 does provide important information on this missing link but it does not provide all the information. It seems to me that the learning objective should be instead to find an invariant f' that should be as close as possible to the best invariant regressor f**. This f** is the one satisfying the optimization problem given by the last equation on page 6. Consequently, it would be nice if the authors could establish an upper bound on E_S || f' - f** ||^2_L2 as a function of s,d, and n. I think this is actually what is missing in this paper. However, I think that the contributions of this paper are already substantial enough to be presented at ICML.

Answer: Thanks for your interesting question! In our problem setting, the function ff^\star is invariant, which implies that f=ff^{**} = f^\star (using the reviewer's notation—i.e., the 'best' invariant estimator is the optimal regressor ff^\star). The distance between f^\hat{f} and ff^\star is measured using the formula R(f^)=ES[f^f2]\mathcal{R}(\hat{f}) = \mathbb{E}_S[ \lVert \hat{f} - f^\star \rVert^2], and this quantity is explicitly upper bounded as a function of nn, ss, and dd in Theorem 1 (second item, line 207).

This is a theoretical paper and the presented theory is nice an relevant; but ... I think the authors should provide the motivation for considering Riemannian minifolds.

Answer: Thank you for your question and constructive suggestion. We considered boundaryless Riemannian manifolds in order to present a general theory that holds in a wide range of settings. Note that the boundaryless assumption is primarily for simplicity, to avoid technical complications related to boundary behavior when defining eigenfunctions. The theory can be extended to manifolds with boundaries under standard regularity assumptions on the boundary. However, we agree that compact subsets of Euclidean spaces are more familiar and accessible to the ICML community. We will revise the paper accordingly in the next version to reflect this suggestion.

You should provide a citation for the result of the expected population risk (in fact, the expected excess risk) of the KRR estimator on page 3.

Answer: Sure, thanks for your suggestion! The bound is derived in several standard references, such as [1], and we will add the appropriate citations to the paper.

The book of Sholkopf ans Smola was published by MIT Press in 2002, not 2018.

Answer: Thanks for pointing this out! We will correct it in our next version!

There are a few conflicting notations. In the supplementary material, g is used to represent a group element and the metric tensor!

Answer: Thanks for pointing this out! Since we frequently use gg to denote group elements, we will revise the notation for metric tensors to avoid any confusion.

\eta is the regularizer of the KRR of Equation 8. But that regularizer is referred to as \lambda on page 8.

Answer: Thanks for pointing this out! Yes, in this case, using the notation λ\lambda instead of η\eta is confusing, as λ\lambda can also be considered as an eigenvalue and so be related to DD, as discussed in Section 6.1. We will correct this to avoid any confusion.

What happens if G does not form a group? For example if you translate an image, it will eventually fall off the boundary and the inverse translation does not exists. This happens often. How can you address this problem?

Answer: Thanks for raising this interesting question. If GG is not a group, it is still possible—under certain conditions—to obtain a sequence of linearly constrained quadratic programs for this problem. The main challenge in such cases is that the existence of a logarithmic-sized generating set is not guaranteed. To address this, one could consider alternative algebraic or analytic approaches to construct suitable "generating sets" for these settings. In our opinion, it may be possible to extend the results to such cases by imposing appropriate assumptions. We will include a discussion of this in the final version of the paper.

[1] Bach, Francis. Learning theory from first principles. MIT press, 2024

审稿意见
4

The paper addresses the challenge of learning with exact invariances (symmetries) in kernel regression. Traditional methods either fail to provide polynomial-time solutions or are not applicable in the kernel setting. The authors propose a polynomial-time algorithm that achieves exact invariances using oracle access to the geometric properties of the input space. The algorithm achieves the same excess population risk as the original kernel regression problem, making it the first polynomial-time algorithm to achieve exact invariances in this context.

给作者的问题

In Algorithm 1, we don't know α\alpha in practice. How does this affect the efficiency of the proposed algorithm?

论据与证据

Yes

方法与评估标准

Yes

理论论述

The proof appears to be correct, although I haven't verified it line by line.

实验设计与分析

No

补充材料

No

与现有文献的关系

The paper's key contributions advance several important areas of prior research, including learning with invariances, kernel methods, and optimization. By addressing the limitations of traditional methods and providing a polynomial-time algorithm for learning with exact invariances, the paper makes a significant theoretical and practical contribution to the field.

遗漏的重要参考文献

No

其他优缺点

Strengths: 1.The paper provides a rigorous theoretical framework for learning with exact invariances, supported by tools from differential geometry and spectral theory.

  1. The proposed algorithm runs in polynomial time, making it practical for real-world applications where traditional methods are computationally prohibitive.

  2. The algorithm achieves the same generalization error as kernel regression without invariances, showing that exact invariances do not compromise statistical performance.

Weaknesses:

  1. The algorithm relies on oracle access to the geometric properties of the input space, which may not always be available in practical applications.

  2. The paper lacks extensive experimental validation, but the theoretical contribution is solid.

其他意见或建议

No

作者回复

We appreciate the reviewer for acknowledging the merits of our work and for raising interesting questions.

The algorithm relies on oracle access to the geometric properties of the input space, which may not always be available in practical applications.

Answer: Thanks for mentioning this interesting feedback. Yes, in general, computing even the first non-zero eigenvalue of the Laplacian can be a very challenging problem in the geometry of manifolds. However, in practical scenarios, we often deal with group actions encoded via matrices over algebraic spaces (such as polynomials), which allows us to access the oracle through polynomial evaluations or by computing products of polynomials (see lines 180–190, second column). Such settings include tori, Stiefel manifolds, spheres, and direct products of these spaces. We present the result for arbitrary manifolds to keep the paper’s contributions broadly applicable, and to highlight that the difficulty of the problem on arbitrary manifolds arises from intrinsic geometric properties, rather than from the learning task itself. We discuss this further in the next version of the paper. Thus, the result applies to practical settings, and the oracle assumption is not restrictive in the problems we often encounter in practice.

In Algorithm 1, we don't know alpha in practice. How does this affect the efficiency of the proposed algorithm?

Answer: Thanks for pointing this out. According to our proof, we only require a lower bound on α\alpha to run the algorithm, and the proof remains valid under such settings. Therefore, it is sufficient to know that the optimal function satisfies at least some degree of smoothness, as encoded by α\alpha. We will incorporate this clarification into the next version of the paper.

审稿意见
3

The paper suggested that, in Kernel Ridge Regression (KRR) under certain assumptions including the kernel space, achieving exact invariance of the kernel function through group averaging is feasible in polynomial time. This is done by using a finite number of bases derived from the constrained spectral method for Laplace-Beltrami operators and leveraging the fact that a small number of generators can represent a large number of group elements.

给作者的问题

None

论据与证据

The claim presented as Theorem 1 is theoretically convincing. However, empirically there is limited evidence to confirm its efficiency.

方法与评估标准

Using a limited number of eigenfunctions and group elements is reasonable. We can choose the target operator based on prior knowledge of the data. It allows the kernel space derived from the operator to span all group elements using only the group's generators.

理论论述

I did not check the proofs explicitly, but overall I got an intuition about why achieving polynomial time is feasible.

实验设计与分析

The synthetic data experiments with sign-invariances on the torus manifold support the efficiency of training regarding the number of training samples. However, since the main theorem and the paper's title emphasize learning with invariances in polynomial time, the paper should empirically compare actual running times (e.g., wall-clock times) and the invariance error of the trained model to ordinary kernels, even if the results seem obvious. Additionally, at least one real dataset, such as from the UCI repository, should be included to show how many bases (D) are needed in practical scenarios to achieve reasonable performance and to show performance with larger groups like finite rotation groups. If a large number of bases is necessary in real cases, this method might not be beneficial compared to approximating group averaging by randomly sampling group elements, despite losing exact invariance.

补充材料

The supplements include explanations of concepts, theorem proofs, and experimental results. I checked the experimental part and it is clearly explained.

与现有文献的关系

Limitation of prior group averaging could be partially resolved by this paper in KRR, although its efficiency in real datasets remains uncertain.

遗漏的重要参考文献

None

其他优缺点

Although the theoretical guarantees are strong and the method is pretty insightful, as I mentioned in the experimental design or analyses section, empirical evidence is limited to support the efficiency in other datasets or groups.

其他意见或建议

The notations DλD_\lambda and DλD^\lambda are pretty confusing.

作者回复

Thanks a lot for your valuable feedback and constructive comments regarding additional experiments. While we mention that we are committed to including additional detailed experiments in the camera-ready version of the paper, we would like to emphasize that the main focus of this work is theoretical, and it lies within the domains of statistical learning theory and computational problems in statistics. We hope the reviewer will kindly reconsider their evaluation in light of this context.

the paper should empirically compare .... the invariance error of the trained model to ordinary kernels even if the results seem obvious.

The invariance error (referred to as Invariance Discrepancy in Section 6.3 of the manuscript) of the trained model using our method (Spec-AVG) is zero; this is why we did not include invariance error of Spec-AVG in the plots. We will add a note in the caption of Figure 1 to make this clearer in the camera-ready version of the manuscript.

Plots of experiments showing the invariance error of ordinary Kernel Ridge Regression (KRR) are provided in Figure 1 of the appendix, serving as proof of concept that KRR is not inherently invariant. This experiment highlights the need for additional considerations beyond the KRR estimator to achieve exact invariant estimators, even in practice.

... empirically compare actual running times even if the results seem obvious ... at least one real dataset, such as from the UCI repository ...

Thanks a lot for your thoughtful suggestion. We will include additional experiments on the prohibitively high running time of existing algorithms for real datasets to further support our motivation in the camera-ready version of the manuscript. As you clearly noticed, the computational costs of existing algorithms that require enumeration over all group elements—such as group averaging and data augmentation—are extremely high. For example, in the case of permutations, the complexity is at least of order Ω(d!)\Omega(d!), which becomes infeasible even for relatively small values of dd—e.g., when d=20d = 20, the number of permutations is already 20!2.43×101820! \approx 2.43 \times 10^{18}, which is about 2 million times larger than the estimated 1.2 trillion parameters of GPT-4 [1] :D.

... If a large number of bases is necessary in real cases, this method might not be beneficial ...

Thanks a lot for your interesting question. Similar estimators that use a cutoff for the number of bases are very classical in statistics, especially in density estimation. Indeed, in many density estimation problems, this class of estimators is known to be minimax optimal and is often the only efficient choice.

It is important to note that the number of bases, DD, required by our algorithm (Spec-AVG) is n1/(1+α)n^{1/(1 + \alpha)}, where α(1,)\alpha \in (1, \infty). Please refer to Line 1 of Algorithm 1. Thus, DD is at most n\sqrt{n}, which is relatively low and practical.

In Figure 2 of the appendix, we provide plots of Spec-AVG with different choices of DD, as well as KRR with varying regularization parameters η\eta, to illustrate the effect of hyperparameters on population risk. We are committed to providing further experiments—including those on real-world datasets—for the camera-ready version of the manuscript to fully address your concern.

... approximating group averaging by randomly sampling group elements ...

The focus of this work is on the theoretical computational-statistical trade-offs in learning with exact invariances, and random sampling does not guarantee exact invariance; hence, it falls outside the scope of this work.

The notations DλD^\lambda and DλD_{\lambda} are pretty confusing.

Thank you for pointing this out! We agree with the reviewer that the notation is confusing, and we will revise it in the final version of the paper.

[1] https://medium.com/%40ceo_44783/why-i-prefer-gpt-4-over-gpt-4o-cf504741e156

审稿人评论

Thanks for the detailed comments. However, I still have some concerns.

  1. I understand the main focus of the paper is theory. However, to support your claim about efficient computation, it would be better to include an experiment on computational efficiency. As you said, a permutation of size dd has a huge number of group elements. In that case, you should reduce the group size and compare it with group averaging. That’s why we need toy datasets. In fact, your toy example already has a relatively small group size (sign invariance with G=1,110G=\\{1,-1\\}^{10}), so G=210=1024|G|=2^{10}=1024. I believe you can compare with group averaging at that scale.

  2. For a top-tier conference paper, I believe a real dataset evaluation is necessary to support the value of the new theory. In particular, your theory suggests a new computationally efficient method, not a theoretical analysis of an existing algorithm or architecture. Not every theory works in practice, which is what we ultimately care about. If you want to focus only on theory, I would suggest submitting to journals or theory-focused conferences.

作者评论

Thank you for the additional explanations, which helped clarify your concerns. We hope our three-fold response below addresses them.

  • This paper studies the computational complexity of learning with invariances. Previous results suggest that learning with exact invariances is likely computationally hard. For example, [1,2] showed that learning Boolean circuits with noise, which is permutation-invariant, is hard unless cryptographic assumptions are broken. [3] proved that learning a shift-invariant periodic neuron is computationally hard due to its cryptographic hardness. More recently, [4] showed exponential lower bounds for many problems under invariances, including basic versions of learning invariant polynomials. Given these results, one might conjecture that learning with exact invariances is not solvable in polynomial time. However, we propose a polynomial-time algorithm for learning with exact invariances using kernel regression!

  • Our algorithm initiates computational statistical trade-offs in learning with invariances, a key area in statistical learning theory and computation. Some works in this field have been published in related venues, including density estimation [5], tensor PCA [6], Ising models [7], supervised learning [8], sparse regression [9]. For further discussion, see the recent FODSI workshop [10].

  • Our goal is to show that exact invariance can be achieved with desirable generalization error in efficient time. This is important because practical invariant models are (1) exactly invariant, (2) efficient, and (3) generalize well. However, it was previously unknown whether an exact, efficient algorithm that also generalizes exists, even in classical settings like kernel regression. Our main contribution is demonstrating that such algorithms exist, placing the computational complexity within the polynomial time hierarchy. This theory supports the success of invariant architectures like GNNs and focuses on showing that the complexity lies in the polynomial hierarchy, rather than introducing a new competing algorithm.

We appreciate the reviewer's comment on the group size for the experiments. In our original response, we mentioned that experiments for group averaging are complex even in medium-scale settings. However, with the small scale suggested by the reviewer, we were able to run an experiment on a real dataset.

We consider the MNIST dataset and use the task presented in [11] to empirically study Deep Sets. In particular, we construct a set of M=6M=6 samples from MNIST (each 28×2828 \times 28 dimensional) and the task is to predict the sum of numbers in the set, which is permutation-invariant. Thus, each datapoint is of dimension 6×28×286 \times 28 \times 28. We produce 100 training and 100 test samples uniformly from the dataset.

For this task, we perform the following methods: (1) (linear) kernel regression with group averaging over 6!=7206!=720 permutations, and (2) using the proposed method in the paper that uses sparsity and spectral averaging. The result (run on cpus) is as follows:

MethodRMSERuntime(s)
Group Averaging7.505757.2249
Proposed5.64642.5768

Here, the runtime of our proposed method is better by a factor of approximately 2323, while it also achieved better root mean-squared error (we conjecture that the reason is that spectral averaging performs smoother operations). Note that both methods here are based on kernel regression, and they cannot beat neural network architectures. We are currently trying to scale up this experiment, and we will include the large-scale results in the next version of the paper. We hope this can address the reviewer's concern regarding experiments on real data showing time efficiency.

  1. Pietrzak, K. "Cryptography from learning parity with noise." ICTP 2012.
  2. Blum, A., Kalai, A., Wasserman, H. "Noise-tolerant learning and the parity problem." J. ACM, 2003.
  3. Song, M.J., Zadik, I., Bruna, J. "Cryptographic hardness of learning single periodic neurons." NeurIPS 2021.
  4. Kiani, B., Le, T., Lawrence, H., Jegelka, S., Weber, M. "Hardness of learning under symmetries." ICLR 2024.
  5. Aamand, A., Andoni, A., Chen, J., Indyk, P., Narayanan, S., Silwal, S., Xu, H. "Statistical-computational trade-offs for density estimation." NeurIPS 2024.
  6. Dudeja, R., Hsu, D. "Trade-offs in tensor PCA via communication complexity." Annals of Statistics, 2024.
  7. Jin, Y., Wang, Z., Lu, J. "Trade-offs in inferring Ising model structures." ICML 2020.
  8. Yi, X., Wang, Z., Yang, Z., Caramanis, C., Liu, H. "More supervision, less computation: trade-offs in weakly supervised learning." NeurIPS 2016.
  9. Arpino, G., Venkataramanan, R. "Trade-offs in mixed sparse linear regression." COLT 2023.
  10. Schramm, T., Trevisan, L. "Computational Complexity of Statistical Inference."
  11. Zaheer, M., et al. "Deep sets." NeurIPS 2017.
审稿意见
3

This paper investigates the statistical-computational trade-offs involved in learning with invariances, particularly in the context of kernel regression. While the Kernel Ridge Regression (KRR) estimator can be applied to this problem, it lacks invariance unless combined with group averaging, which is computationally expensive for large groups. This raises the question of whether statistically sound estimators with efficient time complexity can be developed. The authors demonstrate that by reformulating the problem and reducing the number of constraints using group laws, the task can be expressed as solving an infinite series of quadratic optimization programs subject to linear constraints. The paper present a polynomial-time algorithm that achieves an exactly invariant estimator.

给作者的问题

Could the analysis of the main theorem be extended to cases where the regression function does not belong to the hypothesis space? Could Algorithm 1 be combined with other large-scale optimization methods, such as kernel conjugate gradient methods with random projections or divide-and-conquer kernel ridge regression?

论据与证据

The claims made in the submission appear to be supported by a rigorous theoretical framework

方法与评估标准

The proposed methods and evaluation criteria appear to be well-suited for the problem and application at hand.

理论论述

I did not check the proofs.

实验设计与分析

It seems well-founded.

补充材料

I did not review the supplementary.

与现有文献的关系

Kernel methods, including Kernel Ridge Regression (KRR), are well-established in machine learning for their ability to model nonlinear relationships. However, standard KRR does not inherently handle invariances. The paper leverages the theoretical foundations of KRR but reformulates the problem to incorporate invariances directly.

遗漏的重要参考文献

null

其他优缺点

Strengths: The paper is clearly written and accessible, making it easy to follow. It introduces a novel algorithm with two key strengths: computational efficiency and the ability to enforce group invariances. The theoretical analysis appears thorough and well-developed, providing a solid foundation for the proposed method.

Weaknesses: Conducting more numerical experiments could help demonstrate the proposed algorithm's improvements in computational efficiency and its ability to enforce group invariances while maintaining the same learning rate as the original KRR. Additionally, it would be beneficial to compare the proposed algorithm with other state-of-the-art methods, such as GNNs. However, as a theoretical paper, this should be acceptable in this regard.

其他意见或建议

null

作者回复

Thank you very much for your positive and constructive feedback. We’ve done our best to address your concerns in detail below, and we hope this will support a more favorable evaluation and score of our work.

Conducting more numerical experiments could help demonstrate the proposed algorithm's improvements in computational efficiency and its ability to enforce group invariances while maintaining the same learning rate as the original KRR. Additionally, it would be beneficial to compare the proposed algorithm with other state-of-the-art methods, such as GNNs. However, as a theoretical paper, this should be acceptable in this regard.

Answer: Thanks for your suggestion! While we are committed to adding additional detailed experiments in the camera-ready version of the manuscript, as noted by the reviewer, the main focus of this work is on the computational complexity of learning and is primarily theoretical in nature.

Could the analysis of the main theorem be extended to cases where the regression function does not belong to the hypothesis space? Could Algorithm 1 be combined with other large-scale optimization methods, such as kernel conjugate gradient methods with random projections or divide-and-conquer kernel ridge regression?

Answer: Thanks for your interesting and thoughtful question. The analysis could be extended to cases where the regression function is not in the hypothesis space, under certain conditions. For instance, if the regression function does not belong to the Sobolev space of order α\alpha (the space in which we perform KRR), but instead belongs to another Sobolev space of order β\beta, then it is possible to extend the theory and obtain generalization bounds. However, for arbitrary functions, it is not clear how to identify an appropriate RKHS that would allow the proof to be adapted to such settings. In our opinion, addressing this challenge may require problem-specific approaches.

Regarding computational efficiency, large-scale KRR methods are indeed useful for alleviating the computational complexity of kernel methods, which typically involve matrix inversions and require O(n3)O(n^3) time. These methods effectively reduce one polynomial-time algorithm to another with lower polynomial-time complexity. In contrast, our goal is to propose a polynomial-time solution for a problem whose current solutions have (super)exponential time complexity. In our current setting, it is not clear how to further reduce the complexity using large-scale KRR techniques, and we defer adapting the algorithm to such settings to future work.

审稿意见
4

This work addresses the problem of learning exactly invariant models in the kernel regression setting. Given an assumption on the smoothness of the target function, they propose an algorithm for computing an invariant estimator in polynomial time with respect to the number of samples and polylogarithmic with respect to the cardinality of the group, which is significantly more efficient than other common techniques in invariant learning such as frame averaging and data augmentations. Specifically, the authors reformulate the problem as an optimization in the spectral domain and showcase how they can reduce it into an equivalent problem of optimizing a finite number of linear constrained quadratic programs. While the main text focuses on an extensive derivation of the proposed algorithms, the authors also describe the application of the algorithm in some simple examples, which facilitate the understanding of the method.

给作者的问题

I don't have any significant questions for changing the paper's evaluation.

论据与证据

All of the claims of the paper, mainly theoretical, are supported by comprehensive proofs.

方法与评估标准

While the evaluation of the proposed method is limited to a simple problem, it is sufficient to showcase some quantitative differences between the proposed algorithm and the non-invariant Kernel Ridge Regression (KRR) that acts as a baseline.

理论论述

I reviewed the theoretical derivation in the main text and the supplementary material. I didn't detect any issues.

实验设计与分析

The paper only provides a simple experimental evaluation, which, as stated above, focuses on a single comparison with KRR. A possible interesting addition would be to provide comparisons with other invariant methods such as group averaging, frame averaging, and data augmentations. While this addition is not necessary since the focus of this work is more theoretical, it will better showcase the disadvantages of the alternative methods in an experimental setting.

补充材料

I went over the proofs in the supplementary material.

与现有文献的关系

This work contributes to the field of invariant machine learning and, specifically, invariant kernel regression. As stated in the introduction and related work, there is a large literature on invariant kernel regression that can suffer from higher computational complexity, especially in the case of larger symmetry groups. In this context, the authors' contribution is the introduction of a polynomial algorithm for performing kernel regression that respects known invariances.

遗漏的重要参考文献

I didn't find any significant works that were not referred to in the paper.

其他优缺点

Strengths:

  • A major strength of the proposed algorithm is that while it achieves polynomial time complexity, it doesn't sacrifice the generalization error, achieving the same performance as non-invariant learning methods.
  • The authors provide a comprehensive presentation of the theoretical tools that they then utilize to analyze the proposed algorithm's computational complexity and generalization performance.

Weaknesses:

  • One possible limitation of the proposed method is the focus on discrete symmetry groups. This is not exactly a weakness, but it can limit the setting in which the proposed algorithm can be utilized.
  • Similarly, the limitation of the proposed algorithm in kernel regression, while allowing for extensive theoretical analysis, can limit the application of the proposed method.

其他意见或建议

No other comments or suggestions

作者回复

Thanks a lot for your positive and constructive feedbacks.

The paper only provides a simple experimental evaluation, which, as stated above, focuses on a single comparison with KRR. A possible interesting addition would be to provide comparisons with other invariant methods such as group averaging, frame averaging, and data augmentations. While this addition is not necessary since the focus of this work is more theoretical, it will better showcase the disadvantages of the alternative methods in an experimental setting.

Answer: Thanks for your suggestion! While we are committed to adding additional detailed experiments in the camera-ready version of the manuscript, as noted by the reviewer, the main focus of this work is on the computational complexity of learning and is primarily theoretical in nature.

One possible limitation of the proposed method is the focus on discrete symmetry groups. This is not exactly a weakness, but it can limit the setting in which the proposed algorithm can be utilized.

Answer: Thanks for pointing this out! Extending the results to infinite groups is an interesting and relatively challenging direction for future work of this study. We appreciate the suggestion!

最终决定

This submission is centered around machine learning with (exact) invariances. Particularly, the authors show (Theorem 1) that it is possible to learn a G-invariant estimator (G being a finite group) in polynomial time w.r.t. the dimension (of the smooth manifold forming the domain of the true Sobolev regression function), the sample size, and log(|G|) [where |G| denotes the cardinality of the minimum generating set of G which can be even/significantly smaller than the cardinality of G], achieving the same excess risk guarantee as kernel ridge regression (KRR; which might not satisfy the invariance constraints). The approach provides an alternative to existing methods like data augmentation and group averaging with an exponential improved dependence on |G|. A numerical example is given to illustrate the applicability of the construction, providing a comparison with KRR.

Kernel methods are at the forefront of principled data science; similarly, learning with invariances is a key question of the area. The authors propose a novel method at the intersection of these domains in a clearly-written manuscript which can be of definite interest to the ICML community from both theoretical and practical perspective.