5.8

/10

Poster5 位审稿人

最低3最高7标准差1.5

3.4

置信度

正确性3.0

贡献度3.0

表达2.8

NeurIPS 2024

Nesterov acceleration despite very noisy gradients

Kanan Gupta,Jonathan W. Siegel,Stephan Wojtowytsch

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Stochastic optimizationstochastic accelerationsmooth convex optimizationdeep learningaccelerated gradient descent

评审与讨论

审稿意见

评分: 6置信度: 42024-07-10

The paper studies the accelerated gradient method of Nesterov’s (NAG) for smooth and (strongly-)convex problems under the multiplicative noise model when the variance of the stochastic gradients behave as $O(\sigma^2 \| \nabla f(x) \|^2)$ . The authors identify that the NAG in its original momentum formulation could tolerate multiplicative noise for $\sigma \in [0,1)$ . They propose a modified version which uses 2 step size sequences for the update and momentum steps, respectively, which are a constant factor apart from each other, and prove that their method achieves the optimal rates for any positive value of $\sigma$ for general convex and strongly convex problems.

优点

Clarity and Presentation
The manuscript identifies the key contributions in terms of the algorithmic design and explains how/why the particular parameter choices are necessary for the convergence with optimal rates.

Strengths
The authors identify the limits of NAG and propose an easy but nice fix to remedy the limitations under multiplicative noise.

They are inspired by the continuous time aspect of NAG, similar to [Even et al., 2021] and [Liu and Belkin., 2018] but their modified construction of the Lyapunov function accommodates the 2 step-size-formulation they have. Looking at the analysis for NAG and AGNES, one could see that the separation in the second term of the Lyapunov function, i.e., $\| b(n) ( x’_n – x_n ) + a(n) ( x’_n – x^* ) \|^2$ , enables to handle the multiplicative noise and the 2 step size structure at the same time. This is a simple fix, but it proves to be useful.

To my knowledge, this is the first work that successful studies the general convex and strongly-convex functions under multiplicative noise and proves optimal bounds. I find the results and the simplicity of the proposed method to be important.

缺点

Clarity and Presentation
I think it should be made clearer that the proposed method is as simple, modified version of NAG rather than a new framework as the only difference is using 2 step size sequences, which are only a multiplicative factor of $\frac{1}{1 + \sigma^2}$ apart from each other.

Weaknesses The equivalence between [Liu & Belkin, 2018] makes me regards this work as a new analysis for an existing framework. They also have overlapping proof ideas with [Even at al., 2021] (use of continuous-time perspective, similar Lyapunov function constructions). Having that said, the authors seem to have identified correct modifications on the existing methodologies. I strongly suggest to include a dedicated section of comparison between the analysis techniques and explain their main strengths/contributions in accordance with the related work as it is not clear from the manuscript.

The results for the general convex case is not too interesting in my opinion. The main reason is the existing work on adaptive methods, which essentially prove that $\sum_{i=1}^{n} a_i^2 \| \nabla f(x_i) \|^2 < O(LD^2)$ , where $D$ is the diameter of bounded set of iterates. Please refer to [I - IV] for relevant results. I cannot say anything for sure before analyzing in details but looking at their proofs, it should be fairly possible for those methods to tolerate even to unknown values of $\sigma$ when $\mu = 0$ . Note that most probably this would not be extended to strongly-convex problems.

References

[I] Levy, K.Y., Yurtsever, A., & Cevher, V. (2018). Online Adaptive Methods, Universality and Acceleration. ArXiv, abs/1809.02864.

[II] Cutkosky, A.. (2019). Anytime Online-to-Batch, Optimism and Acceleration. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:1446-1454 Available from https://proceedings.mlr.press/v97/cutkosky19a.html.

[III] Kavis, A., Levy, K.Y., Bach, F.R., & Cevher, V. (2019). UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization. Neural Information Processing Systems.

[IV] Joulani, P., Raj, A., György, A., & Szepesvari, C. (2020). A simpler approach to accelerated optimization: iterative averaging meets optimism. International Conference on Machine Learning.

问题

Could you please comment on other formulations of acceleration such as the optimal method in [V, Eq. (3.11)] and the method of similar triangles [VI]. Specifically, optimal method in [V] has a 2 step size formulation and it might already give the desired result for multiplicative noise.
In Figure 3, what happens to NAG when $\sigma = 0$ ? Please also include a case where $\sigma < 1$ to compare NAG and AGNES.
Could you elaborate on the contributions of your paper in terms of the theoretical analysis w.r.t. prior work?

References
[V] Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103, 127-152.

[VI] Gasnikov, A.V., Nesterov, Y.E. Universal Method for Stochastic Composite Optimization Problems. Comput. Math. and Math. Phys. 58, 48–64 (2018). https://doi.org/10.1134/S0965542518010050

局限性

I would suggest the authors to move the Appendix B.2 to the main text and discuss the equivalence to the prior work clearly.

作者回复

2024-08-06

I think it should be made clearer that the proposed method is as simple, modified version of NAG rather than a new framework as the only difference is using 2 step size sequences, which are only a multiplicative factor of $\frac{1}{1+\sigma^2}$ apart from each other.

We agree that a key point is that our method is a simple modification of NAG. We tried to highlight this in the abstract and will be happy to emphasize it more in the main text of a revised version of the paper.

The equivalence between [Liu & Belkin, 2018] makes me regards this work as a new analysis for an existing framework. They also have overlapping proof ideas with [Even at al., 2021] (use of continuous-time perspective, similar Lyapunov function constructions). Having that said, the authors seem to have identified correct modifications on the existing methodologies. I strongly suggest to include a dedicated section of comparison between the analysis techniques and explain their main strengths/contributions in accordance with the related work as it is not clear from the manuscript.

Thank you for this comment. We agree that our work should be regarded as a new analysis (and derivation) of the method of [Liu & Belkin, 2018]. In addition, the ideas behind the proof, in particular Lyapunov function arguments inspired by an analysis of the continuous dynamics, are fairly well-known in the optimization literature. We derived the method from our analysis of NAG independently and found the equivalence after the analysis was completed. The main novelty is a simple parametrization which simplifies the proofs of our main results (which are novel for this method).

The results for the general convex case is not too interesting in my opinion. The main reason is the existing work on adaptive methods, which essentially prove that

\sum_{i=1}^n a_i^2 |\nabla f(x_i)|^2 \leq O(LD^2),

where $D$ is the diameter of bounded set of iterates. Please refer to [I - IV] for relevant results. I cannot say anything for sure before analyzing in details but looking at their proofs, it should be fairly possible for those methods to tolerate even to unknown values of $\sigma$ when $\mu = 0$ . Note that most probably this would not be extended to strongly-convex problems.

Thank you for pointing out this literature, which we were previously not familiar with, and which we are planning to add to our literature review to include this work. Based upon our reading, this work is very extensive and covers the situations of online optimization and adaptivity with respect to the (additive) noise level $\sigma$ and the smoothness parameter $L$ (which may even be infinite for some results, assuming that the objective function is Lipschitz-continuous). The second case excludes for instance for strongly convex functions, since we have

f(x)\geq f(x^*) + \frac\mu2\|x-x^*\|^2,

i.e. the function grows quadratically at infinity. If the were Lipschitz-continuous, it could grow at most linearly. Many merely convex functions also grow faster than linearly at infinity.

Unbounded gradients particularly affect the multiplicative noise setting we consider: The multiplicative noise is our 'friend' close to the set of minimizers since it becomes small, but it is our 'enemy' at infinity since it becomes large and may send us in the wrong direction. In particular, we have no a priori bounds on the diameter of the trajectory or on the gradient estimates in $L^\infty$ . The fact that we do not constrain the trajectory also allows us to formulate an result in the setting where the objective function does not have minimizers, in which case the optimizer trajectory has to diverge. It is not immediately apparent how the adaptive algorithms would fare in these situations as they have not been studied with multiplicatively scaling noise.

We agree that this is an important part of the literature and we will include a comparison to it in a revised version of the paper. Still, we believe that we contribute new results even in the convex case here.

Could you please comment on other formulations of acceleration such as the optimal method in [V, Eq. (3.11)] and the method of similar triangles [VI]. Specifically, optimal method in [V] has a 2 step size formulation and it might already give the desired result for multiplicative noise.

Again, we thank the reviewer for pointing out this work, which we were previously unaware of. Based upon our reading, the method in reference [V, Eq. (3.11)] is an optimal accelerated method in the plain convex (i.e. not strongly convex) case, which works even with a constraint (the convex set $Q$ ). Without further analysis, we do not know how this method will perform with multiplicative noise, but we believe that such an analysis will be too extensive to add to this paper and thus leave it to future work. We will certainly mention this work and its contribution to accelerated methods in the convex case in a revised version of the manuscript, however. We were also very intrigued by the method of similar triangles, which gives a nice way of deriving accelerated optimization methods for convex and strongly convex objectives that we were not aware of. However, we do not know how this method will generalize to the case of multiplicative noise without significant further analysis, which we again leave to future work. Of course, we will certainly mention the contributions of this paper in our revision.

评论- Official Comment by Reviewer e67y

2024-08-12

Thank you for your responses.

Convex case and adaptive methods
Let me try to explain what I had in mind with more details. First, I do not suggest such methods are better suited for this problem; I can see how their analysis techniques are suitable for handling multiplicative noise. In the proof of adaptive methods, one cannot guarantee descent each iteration unless the method is equipped with linesearch. Therefore, the main strategy is to prove that cumulative error of not knowing the true Lipschitz constant will be bounded by a constant that is guided by the initial step size, Lipschitz constant and the initial distance/bounded domain radius. At the end of the day, the effort of bounding this cumulative error is equivalent to showing that the adaptive step-size has a strictly positive limit and it is lower bounded by a large enough value. The step size is of the form:

\gamma_t = \frac{\gamma}{\sqrt{ \beta + \sum_{k=1}^{t} \alpha_k^2 \|\| \nabla f(\bar x_k) \|\|^2 }}

where $\gamma$ and $\beta$ are some initialization parameters, $\alpha_k$ is the averaging parameter that grows in the order of $O(k)$ and $\bar x_k$ is the weighted average of the sequence(s) generated by the algorithm with respect to the weights $\alpha_k$ .

Note that this step size is non-increasing and assume for simplicity that $\eta_t$ incorporates the knowledge of $L$ and $\mu$ . Then, one could show that, under unbounded gradients, the step-size has a strictly positive limit, and the algorithm should converge at a rate of $O(1/T^2)$ . This translates to showing that $\beta + \sum_{k=1}^{t} \alpha_k^2 \|\| \nabla f(\bar x_k) \|\|^2 \leq O(1)$ .

Multiplicative noise gives us a way to move from the noise vector to the full gradients. In its presence, the analysis will have an additional term of the form $\sigma^2 \sum_{t=1}^{T} \|\| \nabla f(x_t) \|\|^2$ . The main upside is that this summation is in terms of the full gradient, but missing the scaling parameter which could be added to the expression by upper bounding it. Then, we will have two summation terms of the same form; one has a multiplicative constant that depends on $L$ (the original error term) and the other has the variance term $\sigma^2$ . The proof needs additional steps to verify this logic, but I believe it is not too complicated.

My point is not to downplay the results presented in this paper, but I would like to propose that there are multiple techniques that have the potential to achieve the accelerated rates under multiplicative variance. As I mentioned, this technique is not valid when we are aiming for a linear rate. Indeed, [Levy, 2017] has a result for (non-accelerated) adaptive methods for smooth and strongly convex problems. When the algorithm knows $L$ and $\mu$ , it is possible to achieve (non-accelerated) linear rate, however, there is a multiplicative factor $T$ , which also appears in the linear rate exponent.

References: Levy, K. Y. Online to Offline Conversions, Universality and Adaptive Minibatch Sizes. NeurIPS 2017.

2024-08-13

We are grateful to the reviewer for going above and beyond in providing such a detailed outline, and we agree that this is a very interesting direction to pursue. We will look into this, possibly for a follow-up article, depending on how complex it will be to fill in the gaps.

审稿意见

评分: 7置信度: 32024-07-11

This paper presents a new accelerated gradient method for smooth (possibly strongly) convex functions. It is assumed that the available gradients are noisy, where the noise level is proportional to the norm of the gradient.

The authors show that Nesterov's accelerated gradient method (NAG) will converge in this setting when the constant of proportionality for the noise level is less than 1, but will diverge otherwise. To remedy this, the authors present the Accelerated Gradient descent with Noisy EStimators (AGNES) scheme. AGNES employs an additional parameter $\alpha$ in the momentum step, and show that if $\eta = \alpha$ (where $\eta$ is the step size for NAG), then NAG is equivalent to AGNES.

Theoretical guarantees are presented to show that AGNES converges in the convex case, and at a better rate in the strongly convex case. Due to the noise in the gradients, the theoretical results hold in expectation, and the authors also show almost sure convergence in Corollary 5.

Numerical experiments are also presented to support their findings.

优点

Writing. The paper was well written and presented. Clear motivation for the new algorithm was given, as well as a comparison with other results in the literature to put this work in context. I enjoyed reading it.

Contribution. Problems where one encounters noisy gradients are abundant in the literature, so new algorithms for such tasks are welcome. The algorithm is conceptually simple, with few parameters to tune, and theoretical results guarantee convergence.

Numerical experiments. The algorithm is supported by numerical experiments, which show the practical behavior of AGNES, and support the findings of the paper.

缺点

Personally, I would have preferred if the authors had used different notation. For example, in (1), there are two sequences of points, $\{x_n\}$ and $\{x_n'\}$ . But then, the authors use the notation $g_n = g(x_n',\omega_n)$ , i.e., the gradient estimate does not have a prime, but it is associated with the sequence $x_n'$ involving a prime. I found it a bit jarring to read. To be honest, it probably would be even better to have used different letters to denote the two sequences, rather than a prime.

I know it is hard to write in this style where a lot of important information needs to be put off until the appendix. However, in this case, it might have been an idea to have included the Literature Review (Appendix C) in the main body of the paper, and then taken out some of the length in the numerical experiments section (and put it in the appendix) to balance it out to 9 pages.

Both of these are personal preferences for the authors to take note of (I am not asking them to do anything here).

问题

N/A

局限性

N/A

作者回复

2024-08-05

We are grateful to the reviewer for their feedback and encouraging comments.

I would have preferred if the authors had used different notation. For example, in (1), there are two sequences of points, $x_n$ and $x'_n$ . But then, the authors use the notation $g_n=g(x'_n, \omega_n)$ , i.e., the gradient estimate does not have a prime, but it is associated with the sequence involving a prime.

We thank the reviewer for the good suggestion, and we agree that this would make the notation more consistent. We will be happy to replace $g_n$ with $g'_n$ for NAG and AGNES to make it clearer that those gradients are estimated at $x'_n$ .

it might have been an idea to have included the Literature Review (Appendix C) in the main body of the paper, and then taken out some of the length in the numerical experiments section (and put it in the appendix) to balance it out to 9 pages.

We agree that the literature review being in the main body would help the readers put the current work in context of prior works. We will be happy use the additional content page that we would get, if accepted for publication, to move the literature review to the main body of the paper.

审稿意见

评分: 3置信度: 42024-07-12

This paper proposes an accelerated gradient method which is applicable in stochastic convex optimisation with unbiased but highly noised estimator. The method is analysed in a continuous-time convergence framework. Special attention is paid to the application of the method in machine learning, corresponding numerical experiments were carried out which confirm the advantage of the proposed method over some other ones, including classical Nesterov's accelerated method.

优点

The proposed method is indeed a good alternative to a classical NAG method when some simple problem statements are considered, which is the case in the problems chosen for practical evaluation. The analysis in continuous-time fashion seems unnecessary here, yet convenient to proof the results shortly and clearly. Most of minor technical details are clarified in appendices.

缺点

The language requires proof-reading, at least with an automated tool, and should be made more formal. The same recommendation is regarding the mathematical expressions: text takes several liberties here and there, i.e. uses not introduced notions without purpose, applies same-called functions to differently typed arguments, exploits x* for convex functions which are not guaranteed to reach its inf, etc. which makes an impression of that of informal draft, that I believe can be altered, taking into account that no significant details remain unclarified after reading the appendices. Similarly for numerical experiments: std-shadows are not added to some of the plots in stochastic experiments, axes are not labelled, etc. Besides, text contains many remarks on physical meaning, or on qualitative comparison of the proposed method with classical ones, which are so unclear that contribute nothing to understanding, but shift focus from the factual results. To summarise, text is far from being publishable.

问题

Contribution of the paper is not clear. Since we expect that paper contributes either to theory or to practice, I try to find if it does. In theory, it does not make exhaustive literature review (arxiv:2307.01497 is ignored, for example; the only comparision of theoretical guarantees were given on Figure 1 with SGD) which is expectable when yet another accelerated method is proposed, competes with only SGD and NAG, i.e. the simplest methods, does not consider any structured optimisation settings and does not provide some generalizable analysis with any new techniques, but only uses some known tools to show the noise-robustness in the simplest setting, again without comparing their results with the literature on optimisation with state-dependent noise. In practice, it looks better when comparison is carried out on not-toy examples like image classification, but achieved improvement does not look significant in comparison to Adam, and its advantage is still confirmed by few simple problems, which does not convince a practitioner that method worth utilisation. Thus, this method does not seem helpful to the community, at least through its presentation in this text.

局限性

The limitations are clear from reading.

作者回复

2024-08-05

We are grateful to the reviewer for their helpful feedback.

The language requires proof-reading [...] and should be made more formal [...] To summarise, text is far from being publishable.

We understand that our presentation goes against the reviewer's stylistic preferences, and we are happy to integrate feedback into a revised version. To be sure, we would request more specific feedback about alterations the reviewer is suggesting. Since two other reviewers rated the presentation as '4: excellent', we are hesitant to consider changes without further clarification. We understand that this places additional strain on the reviewer's time, and we are very grateful for their service and the feedback they already provided.

text [...] exploits $x^*$ for convex functions which are not guaranteed to reach its inf

The existence of a minimizer $x^*$ is stated as part of the assumptions in Theorems 1 and 3. In fact, we study convex objective functions without minimizers in Theorem 7 (see also the paragraph above it). We are happy to rephrase the statement to emphasize this in a revised version.

If a minimizer does not exist, we do not expect that the algorithm would yield quantitative guarantees since even the 'heavy ball ODE' (deterministic, continuous time) can be arbitrarily slow due to Lemma 4.2 in [1] (see also lines 870-883 in Appendix F).

std-shadows are not added to some of the plots in stochastic experiments

The std-shadows are included in all experiments involving neural networks (albeit imperceptibly small in some plots compared to the oscillation of the trajectory which obscure them somewhat). They are not included in the plots for convex optimization since they made the plots less clear. Standard deviations were small and comparable for all methods considered. We did not find them to add much to the message: If $\mathbb E[f(x_n)] < \varepsilon$ for instance, then $f(x_n)< n\varepsilon$ with probability at least $1-1/n$ by Chebyshev's inequality, directly yielding bounds on $f(x_n)$ since $\mathbb E [f(x_n)] \to 0$ fairly quickly.

In theory, it does not make exhaustive literature review (arxiv:2307.01497 is ignored, for example

The literature review is in Appendix C. We would be very happy to move it to the main text in a revised version and dedicate the additional text page to it. In a large and complex field, the review is certainly not complete, but we provide more context with the literature there (see also the next point).

We were indeed unaware of arxiv:2307.01497 and we will be happy to add it to the literature review.

the only comparision of theoretical guarantees were given on Figure 1 with SGD)[...], competes with only SGD and NAG, i.e. the simplest methods,

We are emphasizing the advantages of acceleration in Figure 1, which are shared with other methods such as ACDM and CNM. We present those methods alongside AGNES in Figure 3. We would be happy to replace the label AGNES in Figure 1 by 'AGNES, ACDM'. We conjecture that the same holds for CNM, but the full result is not available in the literature.

without comparing their results with the literature on optimisation with state-dependent noise

We would be happy to expand the discussion of the results of Vaswani et. al. and Even et. al. in a revised manuscript and to include the reference suggested above.

achieved improvement does not look significant in comparison to Adam

We consider this work a (non-trivial) first step at designing algorithms which combine momentum and adaptivity in a more principled way. The simplicity of the algorithm, e.g. compared to ACDM, is an important part of this. Even in our fairly simple examples, it can be seen that for more stochastic gradient estimates (smaller batch or larger model), AGNES is more stable in its performance compared to Adam. A similar trend would be expected when looking at very diverse datasets (although we do not test this hypothesis here).

[1] Siegel, Wojtowytsch: A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces, arXiv:2310.17610 [math.OC]

审稿意见

评分: 7置信度: 32024-07-13

This paper introduces and studies AGNES (Accelerated Gradient Descent with Noisy Estimators), a generalization of Nesterov's accelerated gradient (NAG) descent algorithm. First they show that NAG's guarantees break in the high noise regime. Then they prove that AGNES can accelerate gradient descent at any noise scale in smooth convex and strongly convex minimization tasks. The paper also provides empirical evidence arguing AGNES's improved performance over existing methods in various settings such as ImageNet.

优点

The strongest aspect of their paper is theoretical result. The fact that they only require two parameters will make it easier to apply to practical applications and also it leads to lesser space usage then Vaswani et al.

缺点

The optimizer comparison is done on small datasets and even then not done well since the authors either do not sweep over hyperaprameters such as learning rate or sweep with a large factor such as 10. This reduces the certainty of the claim that AGNES outperforms NAG in the high noise regime for deep learning.

问题

As the authors mentioned their algorithm is equivalent to a reparameterization of previous algorithms such as that in Liu and Belkin and also I believe the one in https://arxiv.org/abs/1803.05591. This should be highlighted more since currently the paper is written as presenting a new algorithm rather than presenting a better analysis.
Could the authors either add more sweeps over hyperparameters or add these limitations to the empirical claims (such as in Contribution 5)?

I would be happy to increase my rating if the authors could answer the above queries. [Addressed]

局限性

Yes.

作者回复

2024-08-05

We are grateful to the reviewer for their helpful feedback and will be happy to incorporate it in a revised version.

This should be highlighted more since currently the paper is written as presenting a new algorithm rather than presenting a better analysis.

We agree with the point that the novelty of our work lies in the analysis of this algorithm, which is a reparametrization of Liu and Belkin (2018)'s algorithm. We will emphasize this more clearly in the revision and further place our work in the broader context of similar accelerated algorithms. We derived AGNES as a modification of NAG independently in a parametrization particularly suited to proving acceleration in this setting and only realized the equivalence after. We maintain that the 'AGNES' version of the scheme is particularly suited to geometric interpretation and analysis, which is why we retained it.

This reduces the certainty of the claim that AGNES outperforms NAG in the high noise regime for deep learning... Could the authors either add more sweeps over hyperparameters or add these limitations to the empirical claims (such as in Contribution 5)?

The work is primarily theoretical and we are happy to acknowledge the limitations of our experimental section.

To address the question at least on one realistic example, we ran additional experiments testing a much wider range of hyperparameters for NAG for the task of training ResNet34 on classifying images from CIFAR-10. The plots along with the complete details of the experiment are included in the pdf attached to the global rebuttal. The results indicate that AGNES outperforms NAG with little fine-tuning of the hyperparameters.

Details of the experiment: We trained ResNet34 on CIFAR-10 with a batch size of 50 for 40 epochs using NAG with learning rate in { $8\cdot 10^{-5}, 10^{-4}, 2\cdot 10^{-4}, 5\cdot 10^{-4}, 8\cdot 10^{-4}, 10^{-3}, 2\cdot 10^{-3}, 5\cdot 10^{-3}, 8\cdot 10^{-3}, 10^{-2}, 2\cdot 10^{-2}, 5\cdot 10^{-2}, 8\cdot 10^{-2}, 10^{-1}, 2 \cdot 10^{-1}, 5\cdot 10^{-1}$ } and momentum value in { $0.2, 0.5, 0.8, 0.9, 0.99$ }. These 80 combinations of hyperparameters for NAG were compared against AGNES with the default hyperparameters suggested $\alpha = 10^{-3}$ (learning rate), $\eta = 10^{-2}$ (correction step), and $\rho = 0.99$ (momentum) as well as AGNES with a slightly smaller learning rate $5\cdot 10^{-4}$ (with the other two hyperparameters being the same). AGNES consistently achieved a lower training loss as well as a better test accuracy faster than any combination of NAG hyperparameters tested. The same random seed was used each time to ensure a fair comparison between the optimizers. Overall, AGNES remained more stable and while other versions of NAG occasionally achieved a higher classification accuracy in certain epochs, a moving time average of AGNES outperformed every version of NAG (results not plotted due to space constraints, but will be included in the revised manuscript).

评论- Reponse to Authors.

2024-08-08

Thank you for the additional experiments. I am assuming that the authors will add limitations re "happy to acknowledge the limitations of our experimental section" in the final version. I will increase my score.

审稿意见

评分: 6置信度: 32024-07-15

This paper introduces ``Accelerated Gradient Descent with Noisy Estimators" (AGNES), a variant of Nesterov's accelerated gradient descent (NAG), and proves that AGNES achieves an accelerated convergence rate regardless of the noise level relative to the gradient in both convex and strongly convex cases. Additionally, experimental results validate the effectiveness of AGNES.

优点

It is impressive that a simple modification of NAG results in a robust theoretical convergence in stochastic settings. The theoretical analysis, including continuous analysis, seems consistent, and the experimental studies are detailed, particularly in the context of deep learning

缺点

I wonder if AGNES is truly novel. In lines 34 to 36, the authors state, ``In this work, we demonstrate that it is possible to achieve the same theoretical guarantees as Vaswani et al. [2019] with a simpler scheme, which can be considered as a reparametrized version of Liu and Belkin [2018]’s Momentum-Added Stochastic Solver (MaSS) method''. Does this imply that AGNES is essentially the same as MaSS or other previously suggested algorithms? Please clarify this point.

问题

Are Theorem 1 and 2 new theoretical results or just induced by prior works?

I recommend to cite `Continuous-Time Analysis of AGM via Conservation Laws in Dilated Coordinate Systems. J. J. Suh, G. Roh, and E. K. Ryu' as the reference of continuous analysis of NAG type algorithm.

Could the convergence result in Theorem 3 and 4 extend to the expectation of gradient norm?

The fifth line of the caption in Figure 6 needs spacing.

局限性

Yes, the authors addressed the limitations.

作者回复

2024-08-05

We thank the reviewer for their helpful feedback and comments.

Does this imply that AGNES is essentially the same as MaSS or other previously suggested algorithms? Please clarify this point.

AGNES is equivalent to MaSS after a reparametrization (shown in Appendix B.2). We derived the algorithm from NAG independently in a way which made the proofs geometrically transparent and allowed us to draw a clear parallel between Theorems 1 and 2 for NAG and Theorems 3 and 4 for AGNES. After the analysis was completed, we realized the equivalence, but decided to work with the version which, in our opinion, allows for a simpler geometric interpretation and more transparent proofs. The results we obtain for AGNES had not been obtained for this time stepping scheme in either parametrization. We will update the introduction of the article to more prominently reflect our contributions and avoid any confusion about the novelty of the scheme.

Are Theorems 1 and 2 new theoretical results or just induced by prior works?

To the best of our knowledge, Theorems 1 and 2 are new results that we did not find elsewhere in the literature.

I recommend to cite `Continuous-Time Analysis of AGM via Conservation Laws in Dilated Coordinate Systems. J. J. Suh, G. Roh, and E. K. Ryu' as the reference of continuous analysis of NAG type algorithm.

We are grateful to the reviewer for making us aware of this reference, which we will be happy to include in our discussion of continuous-time dynamics.

Could the convergence result in Theorem 3 and 4 extend to the expectation of gradient norm?

Yes, since L-smooth functions satisfy the inequality $\|\nabla f(x) \|^2 \leq 2L (f(x) - \inf f)$ (see line 616 in Appendix E for context), Theorems 3 and 4 automatically lead to a bound on $\mathbb E[\|\nabla f(x) \|^2]$ in both convex as well as strongly convex cases. A bound on $\mathbb E[\|\nabla f(x) \|]$ can then be obtained using Jensen's inequality, i.e. $\mathbb E[\|\nabla f(x) \|]^2 \leq \mathbb E[\|\nabla f(x) \|^2]$ . We would be happy to add a remark about this in the final version of the article.

评论- Response to authors

2024-08-13

Thank you for your response and I will keep my score.

作者回复

2024-08-05

We would like to thank all reviewers for their valuable feedback!

The pdf attached contains an additional sweep over hyperparameters to compare AGNES and NAG as suggested by Reviewer 2pRK.

Details of the experiment: We trained ResNet34 on CIFAR-10 with a batch size of 50 for 40 epochs using NAG with learning rate in { $8\cdot 10^{-5}, 10^{-4}, 2\cdot 10^{-4}, 5\cdot 10^{-4}, 8\cdot 10^{-4}, 10^{-3}, 2\cdot 10^{-3}, 5\cdot 10^{-3}, 8\cdot 10^{-3}, 10^{-2}, 2\cdot 10^{-2}, 5\cdot 10^{-2}, 8\cdot 10^{-2}, 10^{-1}, 2 \cdot 10^{-1}, 5\cdot 10^{-1}$ } and momentum value in { $0.2, 0.5, 0.8, 0.9, 0.99$ }. These 80 combinations of hyperparameters for NAG were compared against AGNES with the default hyperparameters suggested $\alpha = 10^{-3}$ (learning rate), $\eta = 10^{-2}$ (correction step), and $\rho = 0.99$ (momentum) as well as AGNES with a slightly smaller learning rate $5*10^{-4}$ (with the other two hyperparameters being the same). AGNES consistently achieved a lower training loss as well as a better test accuracy faster than any combination of NAG hyperparameters tested. The same random seed was used each time to ensure a fair comparison between the optimizers. Overall, AGNES remained more stable and while other versions of NAG occasionally achieved a higher classification accuracy in certain epochs, a moving time average of AGNES outperformed every version of NAG (results not plotted due to space constraints, but will be included in the revised manuscript).

最终决定Accept (poster)

2024-09-25

The paper studies Nesterov's accelerated gradient descent algorithm for smooth convex and strongly convex problems with noisy gradient estimates, and the noise level is proportional to the norm of the gradient. Acceleration over gradient descent is proved. The theory is solid and four reviewers are positive (two suggest accept and two weak accept). The main question of the reviewer who suggests rejection is about the text, such as the language, the mathematical expressions, and the experimental details. I think these problems can be fixed in the camery ready version. I decide to accept the paper. The AC had a thorough discussion with the SAC and the SAC agreed on the decision. I suggest the authors to fix the problems of presentation pointed out by reviewer AMLK in the camery ready version.