4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.0

置信度

ICLR 2024

Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks

Kaiqi Zhang,Zixuan Zhang,Minshuo Chen,Yuma Takeda,Mengdi Wang,Tuo Zhao,Yu-Xiang Wang

OpenReview PDF

提交: 2023-09-20更新: 2024-02-11

TL;DR

We prove that an overparameterized ResNeXts achieves close-to-minimax rate in learning Besov funtions on a low dimension manifold under weight decay.

摘要

关键词

Nonparametric Classification; Low Dimensional Manifolds; Overparameterized ResNets; Function Approximation

评审与讨论

审稿意见

评分: 3置信度: 32023-10-31

The authors provide a nonparametric analysis of ConvResNeXts---a generalisation of deep convolutional residual networks. In particular, the authors derive approximation and estimation bounds when the target function is in the Besov class, that is the class of functions supported on a low-dimensional differential manifold and having controlled smoothness. In particular, the estimation result shows that ConvResNeXts achieves a rate of error convergence close to the minimax rate.

优点

The paper makes a good combination of several ideas on neural network construction and generalization estimation. The central results are clear and easy to locate. Most claims are well-supported and section 4 is extremely useful in understanding the proofs.

缺点

The lack of a strong comparison with previous works makes it difficult to appreciate the impact of the result. The adaptivity to the data distribution is a well-known property of fully connected networks of any depth (even in the kernel regime), which makes me doubt the necessity of developing a theory of ConvResNeXts. What insights do we get from the analysis that requires using ConvResNeXts?

Secondly, I found the definition of Besov spaces too dense and, hence, difficult to understand. Perhaps a few examples would help, as well as a more detailed account of how such spaces generalise Sobolev and Holder spaces, which are known to a wider audience.

问题

The brief definition of ConvResNeXts at the end of the second paragraph does not suffice to understand the architecture.
Adding a quick definition of Besov spaces when they are first mentioned in the third paragraph would improve readability. It is not reasonable that a reader unfamiliar with these spaces should wait until the conclusions to read a comment on their nature. In addition, the paragraph explaining besov spaces in the conclusions would benefit from further clarification, e.g. some way to understand why Holder and Sobolev is contained in Besov and concrete support to vague claims such as 'Besove spaces can capture important features such as edges'.

评论- Rebuttal Response to Reviewer BAzp

2023-11-23

Dear reviewer,

Thank you for taking the time to review the ICLR paper. I would like to address the question you've raised with the following response.

Weakness 1: The lack of a strong comparison with previous works makes it difficult to appreciate the impact of the result. The adaptivity to the data distribution is a well-known property of fully connected networks of any depth (even in the kernel regime), which makes me doubt the necessity of developing a theory of ConvResNeXts. What insights do we get from the analysis that requires using ConvResNeXts?

We are not sure whether the reviewer is referring to the analysis based on the neural tangent kernel. Chen and Xu (2020, arXiv 2009.10683) show that the neural tangent kernel of deep feedforward networks essentially corresponds to a function class induced by the Laplacian kernel. Though the theoretical analysis relies on the eigendecomposition of the corresponding reproducing kernel Hilbert space, which depends on the underlying data distribution, the obtained statistical rates of convergence are not necessarily adaptive to the intrinsic dimension of the Riemannian manifolds considered in this paper, and therefore still suffer from the curse of dimensionality. For example, Bach (2014, arXiv 1412.8690) only proves the adaptivity to the linear subspace, not more general manifolds. Moreover, to estimate Besov functions optimally, it is essential that the method uses a different kernel "bandwidth" parameters for each local neighborhood. This is something that DNNs in the NTK regime are not capable of (more of this in our Answer 2!).

Existing results on neural networks, such as Chen et al. (2021), do show that feedforward networks can adapt to the low dimensional manifolds. However, their results are not applicable to overparameterized neural networks considered in this paper.

A recent paper (Zhang and Wang, 2023) highlights that weight-decay and overparameterized DNNs in parallel blocks can be optimal for Besov functions due to the weight-decay induced sparsity, but it was left as an open problem whether the parallel blocks were crucial. They also did not consider adapting to the manifold. The architectural insight we proved in this paper is that we don't necessarily need parallel blocks when we have residual connections, and that we can exchange Depth $M$ and Width $N$ as long as their product remains the same. To say it differently, while we did not show that the Convolutional layers are essential, we did provide new insight into why "residual connection" and "parallel blocks' ' in ResNeXts are useful in both approximation and generalization.

Weakness 2: I found the definition of Besov spaces too dense and, hence, difficult to understand. Perhaps a few examples would help, as well as a more detailed account of how such spaces generalize Sobolev and Holder spaces, which are known to a wider audience.

Defining the Besov space is essential for illustrating the merits of overparameterized NN over kernels. Kernel ridge regression with any kernel choices was proven to be suboptimal for estimating functions in the Besov class from the work of Donoho, Liu, MacGibbon in 1990. This includes NTKs. As for Sobolev and Holder classes, they are smaller function classes that do not include functions with heterogeneous smoothness. We have included a new Appendix A which illustrates what ``functions with heterogeneous smoothness'' means concretely and clarifies the relationship between Holder, Sobolev and Besov classes. We hope this addresses your concern.

Q1: The brief definition of ConvResNeXts at the end of the second paragraph does not suffice to understand the architecture.

A1: We have introduced the architecture of ConvResNeXts in Section 2.3 and also see Figure 1(b) for illustration. In the updated paper, we have added a pointer to Section 2.3 and Figure 1(b) in the introduction to enhance the accessibility of our ConvResNeXt architecture description.

Q2: Adding a quick definition of Besov spaces when they are first mentioned in the third paragraph would improve readability. In addition, the paragraph explaining besov spaces in the conclusions would benefit from further clarification.

A2: We have added a quick introduction of Besov spaces in the third paragraph of the introduction to improve readability in the updated version. Thanks for the suggestion! Moreover, we have added a detailed comparison among different functional spaces in the appendix due to the space limit. Roughly speaking, the Besov space contains functions with heterogeneous smoothness while Holder and Sobolev classes contain functions with homogeneous smoothness. ConvResNeXts can achieve a near optimal convergence rate for estimating functions in Besov classes, while this is not achievable by kernel methods. This separation does not exist in smaller function classes such as Sobolev and Holders because they are more homogeneously smooth.

审稿意见

评分: 5置信度: 32023-11-02

The paper studies the capacity of Convolutional residual networks to approximate and estimate smooth (Besov) functions on smooth manifold. The paper focuses on the ConvResNext architecture, a convolutional network with residual connections and parallel modules. The paper shows that these networks can approximate arbitrary smooth target functions supported on a smooth manifold, without suffering from the curse of dimensionality, i.e. with no exponential dependence on the extrinsic dimension of the problem. The work also studies an estimation result, giving a generalization bound for ConvResNext architecture trained on smooth functions on manifolds.

优点

The paper provides novel generalization and approximation results on residual convolutional networks fitting smooth functions on manifolds.

缺点

While both the approximation and generalization results seem novel, it is not clear to me whether the derived bounds are interesting in the context of learning with convolutional networks. In particular, it seems that the approximation results hold just as well for standard feedforward networks, and the main contribution of this work is transforming these bounds from densely connected networks to convolutional networks. The work relies on a result stating that any feedforward network can be reformulated as a convolutional network. Given this result, this just means that any property that holds for feedforward networks holds to some extent for a conv-nets. The property of approximating smooth functions on a manifold seems to be just a particular case where this argument applies. Conv-nets are typically used in cases where they display significant benefits over densely connected feedforward, and I believe this should be reflected in the theoretical results. I think the authors should clearly state the following:

Which of the results in the paper also apply to feedforward networks?
For the results that apply to feedforward networks, which of them are novel to this work? E.g., is the approximation bound on smooth functions on a manifold has been already established for feedforward networks?
In which results there is an actual benefit for using a convolutional networks compared to feedforward dense networks?

Another more minor issue is with the presentation of the approximation and generalization bounds. If I understand correctly, some of the constants in the bounds have exponential dependence on the intrinsic dimension of the manifold, and it is interesting to show this dependence explicitly.

Minor:

Definition 6: there seems to be a typo, with some unclosed bracket.
Bottom of page 5: "hyperparameters parameters" should be just "hyperparameters"?
Top of page 7: "our This"

问题

See above.

评论- Rebuttal Response to Reviewer cTm8 (Part 1)

2023-11-23

Dear reviewer,

Thank you for providing a thoughtful review of our paper. We appreciate your feedback and would like to address the points you have raised in a rebuttal.

Q1: Which of the results in the paper also apply to feedforward networks?

I believe there might be misunderstanding here. Our work focuses on ConvResNexts, which includes the popular ConvResNet as a special example. Our techniques also work for (non-convolutional) ResNets and ResNeXts. However, our results are not directly applicable to feedforward networks without residual connections.

Q2: For the results that apply to feedforward networks, which of them are novel to this work? E.g., has the approximation bound on smooth functions on a manifold been already established for feedforward networks?

Similar results have been established in Chen et. al. (2019) for feedforward networks but without overparameterization or adaptivity to the Besov class. An intermediate result from our theory does apply to overparameterized feedforward networks with residual connections. The decomposition from feedforward networks to convolutional networks enables us to further investigate the depth and width tradeoff in the ConvResNeXts architectures.

The novelty of our theorems is that we consider a more complex model structure, ResNeXt, where there are $N$ residual blocks and each of them has $M$ parallel paths with identity connections. In comparison, existing work only focuses on dense-connected feedforward neural networks. Moreover, our network can be overparameterized by manipulating either a large $M$ , a large $N$ , or both. This stands in stark contrast to the constraints in Zhang and Wang (2022), where $M$ must be large and $N=1$ , and in Liu et al. (2019), where $M=1$ and $N$ must be properly upper bounded. Even though our network is highly overparameterized, we prove that it can efficiently learn the ground-truth function when trained with weight decay. However, classical statistical theories cannot explain this phenomenon due to the huge variance of the neural network class.

In addition, compared to Zhang and Wang (2022), our paper considers the problem where the ground-truth Besov function is supported on a low-dimensional manifold. We develop a close-to-optimal convergence rate which only depends on the intrinsic dimension of the manifold, thus circumventing the curse of the ambient dimensionality.

Q3: In which results is there an actual benefit for using convolutional networks compared to feedforward dense networks?

Convolutional networks (convnets) have long exhibited advantages over feedforward networks, particularly in vision tasks. Our study focused on non-parametric regression and classification with data on a smooth manifold, an abstraction of vision problems where images reside on a "natural image" manifold. Our results show they both ConvResNet, (feedforward) ResNet and whether or not there are parallel blocks as in ResNeXt, they all could work, and one can replace width with depth as long as $MN$ remains sufficiently large. As for whether the convnet is better or worse than the feedforward net, it really depends on which manifold and what kind of functions on the manifold we are estimating.

One important property that helps us prove the convergence rate is the sparsity, which is induced by the block-wise architecture in a ResNeXt. Feedforward neural networks do not have such an architecture, and typically do not have sparsity after training. Because of that, to the best of our knowledge, it has been completely untouched on whether feedforward neural networks without cardinality constraints have the same or better sample efficiency than ResNexts in learning functions in Besov space, not to mention Besov functions on a manifold.

Considering the popularity of ConvResNets, the theoretical study on ConvResNets or ConvResNeXts is quite limited. We are not aware of any existing results on the approximation and statistical theories for such practical networks. Our work attempts to bridge the gap by establishing a close-to-optimal statistical rate of convergence. Though we are not able to show the benefits of ConvResNeXts, our work should still be considered as a substantial contribution, as our setting is very close to real applications in several aspects: Practical architectures with overparameterization, training with weight decay and the data exhibiting low dimensional structures.

评论- Rebuttal Response to Reviewer cTm8 (Part 2)

2023-11-23

Q4: Some of the constants in the bounds have exponential dependence on the intrinsic dimension of the manifold, and it is interesting to show this dependence explicitly.

As per the convention in the non-parametric regression literature, our results focus on characterizing the dependence on the sample size $n$ . The constant may indeed depend exponentially in $d$ . This naturally emerges from the volume of the input space. We do not introduce extra dependence beyond those in the classical work on nonparametric regression, see e.g., Section 3.2 and 4.4 of Gyorgi et al.'s book on "A Distribution-Free Theory of Nonparametric Regression".

审稿意见

评分: 6置信度: 32023-11-07

This paper studies weight decay regularized training of overparameterized convolutional neural network through the lens of nonparametric classification. In particular, the authors consider ConvResNeXts, which cover ConvResNets, and show that such training induced sparsity. This explains why overparameterized neural networks generalize. The authors then prove that the estimation error of ConvResNeXts of functions supported on a smooth manifold deponds on their ambient dimensions. Thus, the curse of dimensionality does not occur.

优点

The paper is well-written, and the problem addressed is important to the community.
I like the fact that the authors study overparameterization of ConvResNets, although not directly.

缺点

Since Theorem 4 depends on finding the global optimizer, I wonder if this is too strong of an assumption.

问题

See Weaknesses.

评论- Rebuttal Response to Reviewer CZj1

2023-11-23

Dear reviewer,

Thank you for taking the time to review the ICLR paper and for your positive feedback. I'm grateful for your engagement and would like to address the question you've raised with the following response.

Weakness: Since Theorem 4 depends on finding the global optimizer, I wonder if this is too strong of an assumption.

We concede that some existing work that studies the effect of overparameterization does study the optimization dynamics and prove statistical learning bounds for the solution that optimization algorithms converge to, e.g., those in the NTK regime. We love these works as they actively unroll! On the other hand, none of these techniques thus far are applicable to our problem. Our Appendix A illustrates why it is provably impossible for NTK to be optimal for estimating functions in Besov classes. So in short, we gained something quite nice by giving up on end-to-end analyzing the optimization algorithm and analyzing the regularized ERM. This approach to decouple learning and optimization has its origin in the pioneering work of Vapnik, Bartlett, and many other learning theorists. We essentially made the same choice for the interest of getting a more fine-grained (in our opinion, the correct way to approach overparameterization) learning theory. We hope the reviewer could see the complementary nature of our work and how it pushes along a different direction in the shared goal of understanding deep learning. Finally, we have added empirical studies to illustrate that while finding global optimal solutions may not be tractable, our theory does match the outcome of training ConvResNeXt using SGD (see Appendix B) in mildly overparameterized regime with representation learning --- something that the NTK theory falls short on.

Thanks again for your appreciation, and I trust that this response adequately addresses your query. Should you have any further feedback or questions, please don't hesitate to share them. Your perspective is invaluable to us.

审稿意见

评分: 5置信度: 32023-11-08

This paper develops the approximation and estimation theory of Convolutional Residual Neural Networks. This paper shows that ConvResNeXt networks can well adapt to a smooth function on a low-dimensional manifold both in approximation and generalization. The authors provide theoretical justification for the overparameterization of such networks as well as training such networks with weight decay.

优点

The authors perform strong, rigorous analysis and develop interesting theoretical guarantees for ConvResNeXt, a well-known network structure that has enjoyed remarkable performance on real applications. This paper is a great contribution to the understanding of such network structure.
The ideas of proof are novel and interesting to me.

缺点

The paper does not provide a direct comparison with prior research on different NN architectures and function spaces. It's not clear to me how this work different from prior ones, and what are the theoretical advantages of your network structure.
The paper is restricted to binary classification with empirical logistic loss. Are your findings extendable to other losses?
The paper is not written in clear language. Some sentences are unfinished, for example the 4th line on page 7.
No numerical experiments are conducted to support the theoretical findings.

问题

For remark 1, it's not clear to me why there is only a small number of non-trivial blocks. It is also unclear to me why weight decay plays a crucial role here.
Why is there no curse of dimensionality in your problem setting? Is this because of the ConvResNeXt structure, the Besov function space or other assumptions?
The authors claim that ConvResNeXts can 'efficiently learn the function without suffering from the curse of dimensionality'. What does 'efficiency' here mean, sample efficiency or computational efficiency?
The authors claim that ConvResNexts learn Besov functions with a better convergence rate close to the minimax rate, which is significantly faster than NTK methods. What is the intuition behind this discovery?

伦理问题详情

No ethics concerns.

评论- Rebuttal Response to Reviewer QcEh (Part 1)

2023-11-23

Dear reviewer,

Thank you for your valuable evaluation of our manuscript. We provide responses to your questions and comments in the following.

Weakness 1: The paper does not provide a direct comparison with prior research on different NN architectures and function spaces. It's not clear to me how this work differs from prior ones, and what are the theoretical advantages of your network structure.

Note that we have already compared our results with related works in Section 1. Here we summarize the difference again as follows:

(1) Neural network architecture: In contrast to prior works, exemplified by Zhang and Wang (2022), which predominantly focus on parallel feedforward neural networks, our investigation centers on the intricacies of stacked ConvResNeXts. This architectural choice introduces a significantly more complex nested function form, presenting us with the challenge of addressing novel issues in bounding the metric entropy --- this is one of our major contributions. Specifically, we adopted a more advanced method that leverages the Dudley's chaining of the metric entropy (via critical radius / local Gaussian/Rademacher complexity, Bartlett et al., 2005), which resulted in a tighter bound even for the much simpler setting of Zhang and Wang (2022).

Reference:

P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005.

(2) Overparameterization regime: Our network architecture comprises $N$ residual blocks, each containing $M$ paths. Consequently, the network can be overparameterized by manipulating either a large $M$ , a large $N$ , or both. This stands in stark contrast to the constraints in Zhang and Wang (2022), where $M$ must be large and $N=1$ , and in Liu et al. (2021), where $M=1$ and $N$ must be properly upper bounded. Even though our network is highly overparameterized, we prove that it can efficiently learn the ground-truth function when trained with weight decay. However, classical statistical theories cannot explain this phenomenon due to the huge variance of the neural network class.

(3) Manifold setting: Compared to Zhang and Wang (2022), our paper considers the case when the ground-truth Besov function is supported on a low-dimensional manifold and we develop a close-to-optimal convergence rate which only depends on the intrinsic dimension of the manifold, thus circumventing the curse of the ambient dimensionality.

Work	Neural Architecture	Overparameterized	Low dimensional manifolds	Function Class	Remark
Zhang and Wang (2022)	Parallel FNNs	Y	N	Besov	Not used in practice
Liu et al. (2021)	ConvResNets	N	Y	Besov	Difficult to train without overparameterization
Chen et al. (2019)	FNNs	N	Y	Holder	Difficult to train due to the cardinality constraints, not used in practice
This work	ConvResNeXts	Y	Y	Besov	Include ConvResNets as a special case

The advantages of ConvResNeXts include that it provides a flexible way to design overparameterized networks where either the number of residual blocks or the number of paths in each block can be large. In addition, the skip-layer connections in ConvResNeXts effectively mitigates the vanishing or exploding gradient issues. Notably, the practical training of ConvResNeXts proves to be more straightforward compared to simpler structures such as FNNs, resulting in outstanding performance across various real-world applications, notably in the domain of image recognition.

Weakness 2: The paper is restricted to binary classification with empirical logistic loss. Are your findings extendable to other losses?

Our findings can be generalized to encompass any loss function that exhibits Lipschitz continuity concerning the model's output, including but not limited to the cross-entropy loss. While our paper predominantly relies on the logistic loss for the sake of clarity and simplicity, the applicability of our results extends to a broader range of loss functions.

Weakness 3: Typo.

Thanks for pointing out the typo. We have corrected it in the latest updated version.

Weakness 4: No numerical experiments are conducted to support the theoretical findings.

We conducted a numerical simulation where we learn a regression function that is supported on the low-dimensional space, and show that ConvResNeXts outperforms other methods with a relatively small model complexity. The results of ConvResNeXts are almost dimension-independent due to the representation learning that helps identify the low-dimensional manifold, while standard non-parametric methods, such as kernel ridge regression and Gaussian processes deteriorate quickly as the ambient dimension gets bigger. We have added the simulation to Appendix B in the updated paper.

评论- Rebuttal Response to Reviewer QcEh (Part 2)

2023-11-23

Q1: For remark 1, it's not clear to me why there is only a small number of non-trivial blocks. It is also unclear to me why weight decay plays a crucial role here.

A1: As has been shown in the proof of Lemma 6, “small” building blocks where the norm of weights bounded by $\epsilon$ contribute little to the output. Consequently, the complexity of the network class hinges on the number of the “large” blocks, where the norm of weights exceeds $\epsilon$ . This observation leads to the conclusion that the covering number of the ConvResNeXts class does not surpass $B_{\text{res}}/\epsilon$ , where $B_{\text{res}}$ represents the upper bound for the summation of norms across all building blocks.

As mentioned in the first paragraph on Page 6, the norm constraints of the weights can be alternatively expressed through weight decay regularization. Therefore, a ConvResNeXt trained with weight decay ensures that the norms of its weights remain bounded. This, coupled with the insights derived from Lemma 6, elucidates how weight decay can induce sparsity in ConvResNeXt architectures.

Q2: Why is there no curse of dimensionality in your problem setting? Is this because of the ConvResNeXt structure, the Besov function space or other assumptions?

A2: Firstly, the ground-truth Besov function resides on a low-dimensional manifold, which makes it possible to learn the function using a sample size only depending on the intrinsic dimension of the manifold. More importantly, the ConvResNeXts can capture the low-dimensional structure of the manifold locally, adapt to the Besov smoothness and achieve the close-to-optimal convergence rate, while this is not achievable by kernel methods such as NTK.

Q3: The authors claim that ConvResNeXts can 'efficiently learn the function without suffering from the curse of dimensionality'. What does 'efficiency' here mean, sample efficiency or computational efficiency?

A3: ConvResNeXts can efficiently learn the ground-truth function in terms of sample efficiency. The risk of our ConvResNeXt estimator converges to the optimal risk at the rate $\tilde{O}( n^{-\alpha(1-o(1))/(2\alpha +d)})$ with $n$ samples. Equivalently, to achieve an $\epsilon$ -estimation error, we need at least $\tilde{\Omega}(\epsilon^{-(2\alpha+d)/(\alpha (1-o(1)) )})$ samples.

Q4: The authors claim that ConvResNexts learn Besov functions with a better convergence rate close to the minimax rate, which is significantly faster than NTK methods. What is the intuition behind this discovery?

A4: We provided a detailed exposition of this effect in Appendix A of the updated paper. Let us summarize the key points here. ConvResNeXts have remarkable expressibility due to its complicated structure. They can efficiently approximate functions with heterogeneous smoothness, such as Besov functions, and meanwhile attain moderate model complexity when trained with weight decay. However, although NTK methods can adapt to low-dimensional manifolds, they require learning a very specific kernel that changes over the spatial region. Moreover, NTK methods can only achieve suboptimal convergence rates of $n^{-\frac{2\alpha - d}{2\alpha}}$ when learning Besov functions (Donoho et al. 1990). The smoothness of the learned function depends on the choice of kernel, so once the target function has heterogeneous smoothness, which is common in Besov space, there does not exist a kernel that can learn the target function well everywhere, thus the learned function must be suboptimal.

Reference: Donoho, et al., Minimax Risk Over Hyperrectangles, and Implications. The Annals of Statistics. 1990.

AC 元评审

2023-12-06

The authors develop a representation and statistical estimation theory for ConvResNeXts, which are generalized versions of convolutional ResNets. The main contribution of the paper is to prove that if the sampled data lies on a low-dimensional smooth manifold, then the ConvResNeXts are more efficient than ResNets in the sense that they automatically adapt to the function smoothness as well as the intrinsic (rather than the ambient) dimension of the data. The techniques used in their proofs (appear to) mostly follow prior published work.

While the contributions were interesting and potentially useful, the paper suffers from key shortcomings. Some reviewers (and I) felt that the paper was very hard to read. In particular, the key novel insights in their results (and proof techniques) were not explained very clearly, especially in the light of what was known before. There was some back-and-forth during the discussion stage between the reviewers and the authors, but questions still remained unanswered. Overall, it appears that the paper may benefit from at least one more round of extensive revision (emphasizing clarity of the contributions, and novelty of the theoretical results in light of the existing literature).

为何不给更高分

Most reviewers (and I) thought that the contributions were not clearly expressed.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject