VC dimensions for deep neural networks with bounded-rank weight matrices
摘要
评审与讨论
The paper establishes upper bounds for the VC dimension of BRFCNNs, analyzes the influence of different ranks on generalization bounds, and demonstrates a BRFCNN with a VC dimension close to the upper bounds, suggesting a nearly tight upper bound and insights into network generalization capacity.
优点
-
The paper derives upper bounds for the VC dimension of BRFCNNs and compares these upper bounds for various settings.
-
The analysis explores how different ranks impact the generalization bounds of BRFCNNs.
-
The paper presents a constructed BRFCNN that achieves a VC dimension close to the upper bounds derived. This demonstration indicates that the upper bound is nearly tight, providing valuable insights into the network's capacity for generalization.
缺点
I am not familar with the topic of VC dimensions for DNNs, and I cannot fully understand the technical contribution of this paper. As a result, I am unable to provide constructive review comments for this paper. However, I strongly advise the authors to revise the paper's main body and offer more in-depth discussions regarding the insights behind the technical contribution.
问题
Could you please clarify the technical challenge in this topic? It would help me understand the technical contribution.
The paper gives an upper bound on the VC dimension of DNNs under the assumption of a bounded rank. This bound is then used to prove generalization bounds and finally a lower bound is given which matches the upper bound in some limits.
优点
By assuming a rank of , the dependence of the VC-dimension on the width of network changes from to which can be a significant change if the rank is much smaller than the width. The tools used could be eneralized to other type of constraints as mentioned in the conclusion.
缺点
The readability of the paper could be improved, there are many Lemmas that are simply stated one after the other without any discussion.
Though the dependence of the bounds on the width are improved, they still fail to explain the empirical observation that the test error generally decreases with in practice.
问题
Some recent results (https://arxiv.org/abs/2305.19008) suggest that the (approximate) rank of the weight matrices could vary between layers with high rank layers towards the beginning and end of the network and approximately low rank weights in the middle. How would this affect the VC dimension?
It seems unlikely that the learned weights are exactly low rank, do you think that the bounds could be generalized to approximately low rank matrices?
This paper focuses on driving a VC dimension of a fully connected feedforward network, with peace-wise polynomial activations and bounded rank fully connected matrices. The authors main theoretical contributions is showing that the VC dimension is bounded by where denotes constant width of layers, is rank of fully connected layers, is the maximum number of intervals int he piece-wise activations, is the maximum degree of these polynomials, and is the network depth. The paper further argues that this upper bound on VC dimension is almost tight for a ReLU network, since it reaches a VC dimension of By plugging their result into the classic VC dimension, generalization error bound by Vapnik, they arrive at a generalization bound for the bounded rank fully connected networks.
优点
The authors pick a seemingly novel problem, the VC dimension of a rank-bounded fully connected networks. The main result on VC dimension implies a generalization bound for these function classes. In the particular case of ReLU network, the error bound roughly grows as which is in meany ways interesting. While a direct test of this result may not be very feasible, it is a pity that there is no empirical evidence is presented to see if this bound is valid.
缺点
Main critique: My big issue with the paper is that the ideas are not presented well, and the technical/notation issues impede the understanding of the reader. While reading the paper I had to frequently go back and forth to try and find the definition of a variable. In other places, there were some parts were very simple concepts (like how activations and fully connected layers are defined), have are explained in an unnecessarily complicated and twisted ways. I've tried to aid the authors with improving these presentation issues in my detailed issues/questions list.
Main technical/issues questions:
- "Due to the rank constraint for the weight matrices, are not necessarily polynomials, instead they could be rational fractions." This sentence is not elaborated on, and on to the best of my understanding, is not correct. a low-rank matrix can be written as , and any polynomial activation applied on the matrix where denotes the previous layer, is a polynomial in the space of the inputs. Please correct me if I'm wrong in making these hand wavy conclusions.
- Remark 3: what happened to the rest of the bound, namely ? Please correct me if I'm wrong, but from what I can see, even if we ignore the -term, the first part is suggesting the generalization error is bounded by while the remark 3 is suggesting only holds with high probability, which is substantially smaller.
- Remark 3: Formally speaking, if converges to then the failure probability term will converge to infinity . Taken from wikipedia, with high probability means "an event ... whose probability depends on a certain number n and goes to 1 as n goes to infinity," I am guessing that what authors want to say is that for some arbitrarily small of , then we can adjust the value such that the bound will hold with a probability that will go to zero with increasing . For example if we set , then the bound for will hold with probability, which will converge to as increases. Again, please correct me if I'm wrong.
Detailed issues/questions:
- p2 , end of page, let this seems to be a typo, shouldn't they belong to instead?
- page 3 mentions and bias from what I see, I don't see the definition of and anywhere in the text. Also, the definition of weights seems to be a matrix, while here it looks like a vector of dimension . I think it's better to clarify this notation a bit and use a simpler, more standard notation
- page 3, while strictly speaking this is not a problem, the variable z as the pre-activation variable, is a confusing at first read. Define as pre-activation first?
- page 5, Thm2: "Let wi be the number of network parameters from layer 1 to layer i." I find this sentence quite confusing and hard to understand. what do you mean here?
- page 8 Thm5: there seems to be a conflict of variable which was previously defined as maximum degree of the polynomial, but here it is used as the bound on the VC dimension. This is particularly troublesome because the first two equations in Thm5 invoke the Thm2 result which does have a "d", and use it in Thm4 which also has d but with a different meaning. Perhaps a change of name of variables?
问题
- I wonder what is the main motivation for this theoretical case study? What is the surprising conclusion of this result which we can draw and apply on some problem? From what I can see this is somewhat missing from the current manuscript. While this is not a direct criticism, including such an explanation will go a long way to persuade and engage the future readers.
- If I understand correctly the result on VC dimension of fully connected network has been addressed previously, and the main result here is about the case where fully connected layers are low-rank. Without delving into too much technical depth, the fact that the degrees of freedom in a rank-limited fully connected weight matrix will reduce from to seems like a borderline obvious thing. Please convince me otherwise, that this is not a simple result?
The paper explores the VC dimensions of DNNs with piecewise polynomial activations and bounded-rank weight matrices. It presents an upper bound for these VC dimensions, constructs a ReLU DNN that reaches this bound, and compares the generalization power of various DNN architectures.
优点
The paper studies an important problem. The comparison of different DNN architectures in terms of their VC dimensions could be a useful tool in the future.
缺点
The main weakness of the paper is the claim of novelty. The authors assert that this is the first study of the VC dimensions of low-rank DNNs, which is contradicted by the existence of prior work [1]. This oversight undermines the originality of the paper. The authors should have acknowledged this previous work and clearly delineated their contributions relative to it.
Another potential weakness is the lack of empirical VC dimensions. In [1], the authors reported empirical VC dimensions, which provided a practical validation of their theoretical results. The inclusion of similar empirical results would strengthen the current paper.
[1] Empirical Study on the Effective VC Dimension of Low-rank Neural Networks, ICML 2021 OPPO
问题
- How do the authors differentiate their work from the prior study [1] on the VC dimensions of low-rank DNNs? A clear comparison of the two studies would be beneficial.
- Can the authors provide empirical VC dimensions to validate their theoretical results, as was done in [1]? This would enhance the practical relevance of the paper.
Summary: The article studies the VC dimension of networks with piecewise polynomial activation and bounded rank weight matrices.
Strengths: Reviewers find the paper studies an important, seemingly novel problem.
Weaknesses: Main concerns are the presentation and lack of discussion.
I find that although the article proposes interesting results, I agree that the presentation and discussion, particularly of implications, need improvements. Reviewers have provided diverse suggestions. The authors did not respond to the initial reviews. Hence I am recommending to reject the article.
为何不给更高分
The authors did not respond to the initial reviews.
为何不给更低分
NA
Reject