PaperHub
6.0
/10
Poster4 位审稿人
最低3最高8标准差1.9
6
3
7
8
3.0
置信度
正确性2.8
贡献度3.0
表达2.5
NeurIPS 2024

The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

OpenReviewPDF
提交: 2024-05-11更新: 2024-11-06
TL;DR

We analyze the sample complexity of full-batch vanilla gradient descent on a stochastic convex problem

摘要

We analyze the sample complexity of full-batch Gradient Descent (GD) in the setup of non-smooth Stochastic Convex Optimization. We show that the generalization error of GD, with common choice of hyper-parameters, can be $\tilde \Theta(d/m+1/\sqrt{m})$, where d is the dimension and m is the sample size. This matches the sample complexity of worst-case empirical risk minimizers. That means that, in contrast with other algorithms, GD has no advantage over naive ERMs. Our bound follows from a new generalization bound that depends on both the dimension as well as the learning rate and number of iterations. Our bound also shows that, for general hyper-parameters, when the dimension is strictly larger than number of samples, $T=\Omega(1/\epsilon^4)$ iterations are necessary to avoid overfitting. This resolves an open problem by Schlisserman et al.23 and Amir er Al.21, and improves over previous lower bounds that demonstrated that the sample size must be at least square root of the dimension.
关键词
Stochastic Convex OptimizationLearning Theory

评审与讨论

审稿意见
6

The paper shows a new lower bound for the generalization error of gradient descent for Lipschitz functions. The bound shows a linear dependency on the dimension that closes the gap between lower and upper bounds in the sample complexity of GD under several regimes. The construction of the function relies on a block version of a variation of Nemirovski function and a reduction from a sample-independent oracle to a sample-dependent oracle. The authors also propose a set of open problems related to the generalization error of GD.

优点

  • The paper shows a clever way to make changes and leverage Nemirovski and Feldman's function so as to control the dynamics of the sub-differentials
  • The authors try to give the best intuition for the construction of the function in the presentation of the paper
  • The paper is technically challenging and the proofs have been explained well. I have read some parts of the proofs and didn't see any problem.

缺点

  • I think the main weakness in the paper is the presentation of the results and improvements. I know that these have all been written in the paper, but I found myself lost and rereading many times to see the significance of the results. This could be rewritten in a more concise way with all the improvements under which regimes spelled out explicitly (maybe a table would make sense).
  • There are a few typos: Line 294: sequences. Line 296: receive(s). Line 300: S1:0S_{1:0}, should S be bolded? Line 320: punctuation and line break.
  • It should be mentioned that Eq 16 will be proven in Appendix D.

问题

  • I'm really confused by the fact that all bounds and the proofs use F(0)F(0) for the gap. Why is this the case? Why isn't it minwF(w)\min_w F(w)? If the algorithm starts at w0=0w_0=0, doesn't it mean the function value increases?

局限性

Yes

作者回复

Thank you very much for the review and detailed comments.

  • This could be rewritten in a more concise way with all the improvements under which regimes spelled out explicitly (maybe a table would make sense).

A table is an excellent suggestion. We will follow it, thanks!

  • It should be mentioned that Eq 16 will be proven in Appendix D.

Thanks for catching this. It will be mentioned.

Also thanks for catching typos!

  • I'm really confused by the fact that all bounds and the proofs use 𝐹(0) for the gap. Why is this the case?

Thanks for pointing this out. As to the bounds, it will be a good idea to state the results in terms of minwWF(w)\min_{w\in W} F(w) and not F(0)F(0). (clearly, F(w)F(0)F(w)minwWF(w)F(w)-F(0) \ge F(w)-min_{w\in W} F(w)).

As for the proofs you are right. We get the excess error w.r.t initialization, which may seem strange. There is no contradiction here, as GD minimizes the empirical loss and the increase is w.r.t population loss. But overall it is strange that you start at an initialization that is better than where you end up with. This also happens, btw, in other similar constructions (see [3])

评论

I thank the authors for the response.

Regarding my question, Is there a typo where the authors wrote F(w)F(0)F(w)minwF(w)F(w) - F(0) \ge F(w) - \min_w F(w) because it clearly should be the other direction. Here, my question explicitly is given an the bound on F(w)F(0)F(w) - F(0), how to I obtain a bound for F(w)minwF(w)F(w) - \min_w F(w), because it's not obvious to me, unless I'm missing something.

评论

Yes, it's a typo, sorry for that. We have, F(w)minwWF(w)F(w)F(0)F(w)- \min_{w\in W} F(w) \ge F(w)- F(0) and not the other way around. This is also the direction we actually want.

So, any lower bound of the type F(w)F(0)=Ω(f(m,η,T,d))F(w)-F(0) = \Omega(f(m,\eta,T,d)) automatically yields a lower bound of the type: F(w)minwWF(w)=Ω(f(m,η,T,d))F(w)- \min_{w\in W} F(w)= \Omega(f(m,\eta,T,d)). In particular, theorem 1 corollary 2 and corollary 3 can all be stated with respect to the minimizer instead of F(0)F(0).

Does that answer your question?

评论

Yes, I somehow was confused and didn't realize that. Thank you for the response! I maintain my current score.

审稿意见
3

The paper proved a tight lower bound for the sample complexity of full-batch gradient descent. The authors also presented some open questions in this area.

优点

I think the paper is not ready.

缺点

  • The structure of the paper is not standard. There isn't any conclusion, and it seems that the paper is written in a very rushed manner.
  • The notation section is not well-written.
  • It seems that the main contribution of the paper is in Theorem 1, but it is not clear where its proof is located. The proof in the appendix is very short, and it appears that some parts of the proof are in the main body of the paper and some in the appendix. It is not organized very well.

问题

I couldn't understand the paper very well, so I do not have any questions. Please refer to the weaknesses section.

局限性

I think the paper is not ready for this conference, and I am not sure how important the result is.

作者回复

Thank you very much for taking the time to review the manuscript.

  • The conclusions are clearly presented in the manuscript, but are also present in the other reviews. If, after reading the other reviews, you still feel that certain contributions are not highlighted enough, please point concretely and we will gladly highlight them.
  • As to notations, please share concrete examples to missing/confusing/non-standard notations. We will gladly fix any such confusion.
  • As to organization, all proofs appear in the appendix. Main theorems and main Lemmas appear in the main text. This is a standard practice.

Unless there are strong reasons for rejection, we ask the reviewer to higher the score so it will not reflect negatively on the paper.

评论

Dear Authors, I’m sorry that my score is lower than that of the other reviewers. As you requested, I reviewed the opinions of the other reviewers and noticed that reviewer gMtP also found this paper difficult to read. I even reread the paper to better understand it and reconsider my score, but I still couldn’t fully grasp the content. This remains my main issue with the paper.

I’ve provided some suggestions that might help improve the paper. For example, adding a "Notation Section" and a "Conclusion Section" and reorganizing the paper and proofs to enhance readability would be beneficial. In its current version, your paper ends with a Lemma, which is quite unusual. As you can see, my confidence score is 2 because I couldn’t fully understand the paper, and therefore, I cannot give you a high score.

However, if the other reviewers are confident in their scores and support accepting this paper, I will not oppose their decision.

Additionally, during the remaining discussion period, I will make another effort to fully understand your work, and if possible, I will adjust my review accordingly.

审稿意见
7

This paper studies the sample complexity of full-batch gradient descent under stochastic convex optimization problem. The main result of this work provides a lower bound of the generalization gap of GD of order Ω(mindm,1minηd3/2,ηT,1)\Omega(\min\\{\frac{d}{m},1\\} \cdot \min\\{\eta d^{3/2}, \eta\sqrt{T},1\\}), where dd denotes dimension, mm denotes sample size, and η,T\eta,T denotes learning rate and number of iterations of GD. Notably, this lower bound, together with upper and lower bounds of GD or general ERM established in prior works, implies meaningful results under different regime. In the over-parametrized regime where d=Ω(m+T1/3)d=\Omega(m+T^{1/3}), the main result, together with lower bound of empirical risk of GD, yields a lower bound of Ω(minηT+1ηT,1)\Omega(\min\\{\eta\sqrt{T}+\frac{1}{\eta T}, 1\\}). Moreover, when T=O(m3/2)T=O(m^{3/2}) and η=Θ(1/T)\eta=\Theta(1/\sqrt{T}), the main results yields a lower bound of Ω(mindm+1m,1)\Omega(\min\\{\frac{d}{m}+\frac{1}{\sqrt{m}},1\\}), matching the known upper bound of general ERM algorithms.

优点

This paper provides an improved lower bound on the sample complexity of full-batch gradient descent for SCO. Under several regimes, the lower bound match existing upper bounds of GD or general ERM and becomes optimal. Moreover, it implies that the worst case sample complexity of GD is the same as that of general ERM. These results contribute to better understanding the sample complexity of the SCO problem.

缺点

The technical results are solid, but I think the writing could be improved. In particular, the connection between the high-level intuition provided in Sec 4 and the actual constructions in Sec 4.1 is not apparent. For example, it's not clear to me why the oracle O(t)\mathcal{O}^{(t)} defined in Eq. 11 satisfies the zero-chain property depicted in Figure 1.

问题

  • From Thm 1 to Cor 3 when η=Θ(1/T)=Ω(m3/4)\eta=\Theta(1/\sqrt{T})=\Omega(m^{-3/4}) and dmd\le\sqrt{m}: I agree that the first term in Thm 1 is lower bounded by mindm,1min1m,1\min\\{\frac{d}{m},1\\} \ge \min\\{\frac{1}{\sqrt{m}},1\\}, but the scond term ηd3/2d3/2m3/4\eta d^{3/2} \gtrsim d^{3/2}m^{-3/4} could potentially be smaller than 1. If so, should the lower bound actually be min1m,1mind3/2m3/4,1\min\\{\frac{1}{\sqrt{m}},1\\}\cdot \min\\{\frac{d^{3/2}}{m^{3/4}}, 1\\}?

  • line 219: could the authors elaborate why this new oracle defined in 214-218 only activates O(d)O(d) coordinates in d3d^3 iterations? From my intuition, to activate a new coordinate, say k+1k+1, the oracle needs to increase all previous coordinates to kk. Then this would only require d2d^2 iterations to activate O(d)O(d) coordinates. I think I miss something here.

Minor comments:

  • It would be clear to clarify early in the paper that w(i)w(i) denotes the ii-th coordinate of a vector ww.
  • 194: N(0)\partial N(0) => N0(0)\partial N_0(0)
  • 299: Let => let
  • 300: S_1:0\mathbf{S}\_{1:0} => S_1:0S\_{1:0}

局限性

N/A

作者回复

Thank you very much for the constructive comments and the positive score. Below are answers to concrete questions.

  • For example, it's not clear to me why the oracle 𝑂(𝑡) defined in Eq. 11 satisfies the zero-chain property depicted in Figure 1.

Notice that figure 1 refers to eq. 16 and not eq. 11. The figure depicts the specific dynamics of choice and the subdifferentials as appear above eq. 16. The caption of the figure should be rephrased as: “Depiction of the dynamics depicted above Eq. (16)”. Or to refer to the main text instead ““Depiction of the dynamics depicted below Eq. (9)”.

Frames 3,6,7 depict the case of item 1 (below eq. 9), and frames 1,2,4,5 the case of item 2 (below eq. 9) and last frame is the last item (partial = 0). Is that more clear?

  • should the lower bound actually be min(1m,1)min(d3/2m3/4,1)\min ( \frac{1}{\sqrt{m}},1)\cdot \min(\frac{d^{3/2}}{m^{3/4}},1) [[as opposed to 1/m1/\sqrt{m}]]

Notice that the Ω(1/m)\Omega(1/\sqrt{m}) term does not follow from thm 1 but, as stated in the manuscript, “is a well known-information theoretic lower bound for learning”. A citation will be added to the manuscript to clarify. For example, this lower bound can be found in lecture notes: https://optmlclass.github.io/notes/optforml_notes.pdf (theorem 16.7) where it is proven that there are two instances of stochastic convex optimization where every algorithm “fails” on one of them with excess error 1/T1/\sqrt{T} unless more than TT examples are observed (in our notation T=mT=m)

  • Could the authors elaborate why this new oracle defined in 214-218 only activates O(d)O(d) coordinates in O(d3)O(d^3) iterations?

Roughly, to increase coordinate kk you need to increase each previous coordinate by +1+1. Increasing coordinate t requires k-t iterations (increasing kk and propagating it back). So overall t=01kt=O(k2)\sum_{t=0}^1 k-t= O(k^2) iterations in order to increase the kk’th iteration by 1. So overall O(1+22+32..+d2)O(1+2^2+3^2….. + d^2) to increase all coordinates by one. Which amounts to O(d3)O(d^3).

Regarding minor comments, thanks for the suggestions and for catching typos. We will follow the reviewers' advice and corrections.

评论

Thank you to the authors for their detailed response. After considering your explanation and re-evaluating the manuscript, I now have a clearer understanding of how the oracle functions. Thanks for the clarification, and I maintain my original rating.

审稿意见
8

Summary:

The authors study the generalization error and sample complexity of GD for GD in stochastic CO. Their results show that one can achieve the generalization error of Θ~(d/m+1/m)\tilde{\Theta}(d/m + 1/\sqrt{m)} which is the same as the generalization error of ERM. Indeed, they prove that the linear dependence to dimension dd in unavoidable. Moreover, they show that 1/ϵ41/\epsilon^4 steps are needed for GD to avoid overfitting where ϵ\epsilon is the optmality gap.

优点

Pros:

  • interesting theoretical problem/results
  • excellent citation to related work
  • extremely well-written

缺点

Cons:

  • some discussion about SGD is missing

问题

Questions/Comments:

This is an excellent paper, I recomment acceptance as I found the paper well-written and the contributions/motivations are clear.

  • Can you explain how this analysis restricts to full-batch GD and what makes it impossible to obtain results for SGD? I recommend adding a bit of discussion to the paper regarding this.

局限性

N/A

作者回复

Thank you very much for taking the time to review the manuscript, and particularly for the positive review.

  • Can you explain how this analysis restricts to full-batch GD and what makes it impossible to obtain results for SGD? I recommend adding a bit of discussion to the paper regarding this.

Thank you for the question. First, just to clarify, the results cannot be obtained for SGD as SGD has dimension-independent sample complexity and this is discussed in the paper. But the question remains how SGD circumvent the construction. A discussion about that will be added to the paper as follows:

The main issue with the proof is that it relies on a reduction to sample dependent oracle (defined in section 4.1). SGD circumvent the reduction (Lemma 9). In more detail, the Lemma assumes the update is given by eq. 12 (the full-batch update rule). Because of this full-batch update rule, we can "encode" the whole sample in the trajectory. This allows the reduction, because the oracle that depends on the parameter can "decipher" the sample and behave like sample dependent oracle. If the sample points are coming one-by-one, as the case in SGD, then we do not know the sample in the first steps, hence we cannot reduce to a sample-dependent first order oracle.

评论

Thanks! I appreciate the authors' response and promise to include the new discussion in the next version of the paper. I still support accepting the paper, so I keep my score positive.

最终决定

The paper has a clean result on a classical problem and majority of the reviewers seem to share the sentiment. One of the reviewers points out that they found reading the paper hard, perhaps the author can mitigate this issue and strive for better readability. I found the paper to be easy to follow but perhaps this comes from prior area of expertise etc. My main and only if you can call it negative point about the paper is that it focuses on stochastic convex optimization and so while the problem is fundamental it has limited scope in current focus areas in ML and optimization. Overall I recommend the paper for an accept.