PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
5
6
6
3.3
置信度
正确性3.3
贡献度2.3
表达3.3
NeurIPS 2024

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

OpenReviewPDF
提交: 2024-05-10更新: 2025-01-14
TL;DR

We derive new convergence rates for gradient descent which depend only on local properties of the objective using directional smoothness.

摘要

We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on $L$-smoothness.
关键词
directional smoothnessgradient descentexponential searchpolyak stepsizenormalized gradient descent

评审与讨论

审稿意见
5

The work propose different directional smoothness function and use them to establish sub-optimality bounds that are adaptive to the optimization trajectory. This approach can be explicitly used in quadratics or with an exponential search technique for convex objectives to use adaptive stepsizes that enjoy better guarantees compared to constant stepsize GD tuned according to the global smoothness. Furthermore, the paper shows that Polyak and normalized GD obtain convergence guarantees that depend on the directional smoothness without modifying the algorithms. The theoretical claims are empirically evaluated using two logistic regression problems.

优点

  1. The technique naturally yields results for several existing methods.
  2. Tighter, adaptive guarantees are appealing due to the observed success beyond the theoretical convergence bounds.

缺点

  1. The paper discusses only deterministic optimization, while modern large-scale optimization is largely stochastic. It begs the question, which is not discussed in the paper, whether such approaches are even valid for stochastic optimization (e.g. [1]).

  2. The novelty of the approach is unclear for some of the observed results. Consider especially Theorem 4.4 with the Polyak stepsizes, which achieve the best empirical performance in the experiments. It is straightforward to translate the decrease mentioned in [2] (page 2, end of left column), xt+1x2xtx2(f(xt)f(x))2f(xt)2\lVert x_{t+1} - x^* \rVert^2 - \lVert x_{t} - x^* \rVert^2 \leq - \frac{(f(x_t)-f(x^*))^2}{\lVert \nabla f(x_t) \rVert^2} to a bound of the weighted x\overline{x} of f(xt)f(x)x0x2k=0tηtf(\overline{x_t})-f(x^*) \leq \frac{\lVert x_0 - x^*\rVert^2}{\sum_{k=0}^{t} \eta_t}. Theorem 4.4 can be considered as replacing ηt\eta_t with a possibly worse M(xt+1,xt)M(x_{t+1},x_t). In this case, the simple adaptive result is only framed for convenience in the directional smoothness framework.

[1] N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306–1314, 2021.

[2] Hazan, E. and Kakade, S., 2019. Revisiting the Polyak step size. arXiv preprint arXiv:1905.00313.

问题

  1. Can the framework be of use for stochastic optimization?

  2. Can the authors elaborate on the technical novelty of the results? In particular, it is arguably known that the standard analysis technique of GD exploits smoothness only between consecutive points. What is the main difficulty when we replace global smoothness with an adaptive variant?

Overall, the paper presents an appealing framework for adaptive bounds which encompasses a long list of stepsize strategies, while the novelty and deterministic setting are somewhat in question.

局限性

See weaknesses.

作者回复

Dear Reviewer,

Thank you for your time and review. We were pleased to see that you thought our "tighter, adaptive guarantees are appealing due to the observed success beyond the theoretical convergence bounds" and that "the technique naturally yields results for several existing methods."

We understand you currently do not appreciate Theorem 4.4, but our paper has many other results (a total of 11 Lemmas, propositions and Theorems). Does the reviewer appreciate or have questions regarding any other specific part of our paper?

  1. The paper discusses only deterministic optimization. Can the framework be used for stochastic optimization?

We agree that developing stochastic results is desirable and we are confident that our framework is useful for stochastic optimization. For instance, directional smoothness has already been used (e.g. in [2]) to explore empirical properties of stochastic algorithms. However, we must leave exploring this in detail to future work.

Since directional smoothness is a new concept, we need the space in this paper to carefully characterize classes of functions which have a meaningful directional smoothness function, and to show how this is a useful concept for the optimization. This is consistent with the literature, where optimization concepts are often studied in the deterministic setting before being extended to stochastic optimization. For example, you have linked papers [4, 5] on the Polyak step-sizes; [5] only consider the deterministic setting and it took several years after its publication for [4] to study the stochastic setting.

  1. The novelty of the approach is unclear for some of the observed results.

The novelty of our approach is proposing a new directional smoothness assumption, showing that this assumption holds for differentiable functions, and then showing how directional smoothness function can be used to precisely characterize the convergence of first order methods.

  1. Theorem 4.4 ... the simple adaptive result is only framed for convenience in the directional smoothness framework.

Theorem 4.4 says that for any directional smoothness function MM, the sub-optimality bound D2k=0t1M(xk+1,xk)1\frac{D^2}{\sum_{k=0}^{t-1} M(x_{k+1}, x_k)^{-1}} holds, where D=x0xD=\|x_0-x_{\ast}\|. The point of this theorem is to show that Polyak's method has a certain universality: whatever directional smoothness function MM exists, Polyak's algorithm adapts to it. The proof the reviewer gives does not show this, and while useful for establishing a convergence bound, doesn't indicate a universality property. Indeed, many algorithms achieve rates that look like D2k=0t1ηk\frac{D^2}{\sum_{k=0}^{t-1} \eta_{k}}, e.g. GD with any step-size sequence ηk1L\eta_{k} \leq \frac{1}{L} achieves this rate. What sets our theorem apart is that it shows the Polyak step-sizes possess an adaptivity that GD with a non-adaptive step-size sequence does not.

However, we will also add a remark showing how to conclude the proof by keeping the Polyak step-size (i.e. without replacing it by directional smoothness). This observation gives us new insights into the Polyak step-size, showing that the Polyak step-size itself is a type of directional smoothness, and can be used to form a smoothness upper-bound such as Eq (1).

  1. It is arguably known that the standard analysis technique of GD exploits smoothness only between consecutive points.

This is exactly what our work shows explicitly. Though intuitively GD is a local method, and should depend on local notations of smoothness, we found that there is no satisfying prior definition and theory that captures this local behaviour. Unless the reviewer has a reference in mind that they could share?

We also note that tight proofs for the convergence of GD for non-strongly convex functions (e.g. [3; Theorem 3.3]) actually do rely on smoothness everywhere rather than just between iterates. This is because they use an inequality sometimes called co-coercivity, [3; Lemma 3.5] which requires global smoothness. Moreover, the existence of an extensive literature on local Lipschitz continuity shows that relaxing global smoothness assumptions is not straightforward in general.

  1. What is the main difficulty when we replace global smoothness with an adaptive variant?

Firstly, discovering the correct definitions of local/adaptive variants of smoothness is not straightforward. Secondly, verifying conditions under which these variants yield directional smoothness bounds and proving bounds like Prop. 3.1, which allow for steps to increase the optimality gap, both require developing non-standard proof techniques. The fact that these results seem obvious in hindsight is a testament to the fact that we "got it right" compared to the many other relaxations of smoothness which have been proposed.

The main technical difficulty is that the directional smoothness depends on the step-size. The standard analysis of GD starts with smoothness as f(xk+1)f(xk)+f(xk),xk+1xk+L2xk+1xk2=f(xk)ηk(1ηkL2)f(xk)2,f(x_{k+1}) \leq f(x_k) + \langle \nabla f(x_k), x_{k+1} - x_k \rangle + \frac{L}{2} \\| x_{k+1} - x_k \\|^2 = f(x_k) - \eta_{k} (1 - \frac{\eta_k L}{2}) \\| \nabla f(x_k) \\| ^2, from which we obtain convergence of the gradient norm when ηk<2L\eta_k < \frac{2}{L}. When we replace LL by a directional smoothness MM instead, we get the requirement ηk<2M(xk+1,xk)\eta_{k} < \frac{2}{M(x_{k+1}, x_k)}. Both sides of the equation depend on ηk\eta_{k}, as xk+1x_{k+1} depends on ηk\eta_k. This implicit equation is not straightforward to solve and it's not obvious that a solution even exists. We prove that strongly adapted step-sizes do exist, derive new direct methods for computing such step-sizes (root-finding, exponential search), and establish novel connections to normalized GD and the Polyak step-size. Again, we believe the fact that these results are clean and easy-to-follow in hindsight is a strength of our work, not a weakness.

评论

I thank the authors for their response.

After reading the other reviews and rebuttal responses, I tend toward the acceptance of the paper and I raise my score accordingly.

In any case, extending the framework to the stochastic case, even under stronger than usual noise assumption, would give the framework more credibility and make the case for directional smoothness much stronger.

评论

Thank you for reading our response and for updating your score! We greatly appreciate your help in improving our paper.

We will make sure to carefully investigate the stochastic case as part of future work.

审稿意见
6

The paper proposes a new type of non-uniform smoothness, which the authors label directional smoothness. Directional smoothness replaces LL in the typical LL-smoothness inequality by a function M(x,y)M(x,y), which describes the smoothness along the line between xx and yy. The authors proof multiple basic properties of this notion as well as sub-optimality bounds for GD with general stepsizes in the deterministic convex setting. Additionally, they provide sub-optimality bounds for three specific stepsizes in the same setting.

优点

  • The proposed smoothness notion – directional smoothness – is generalising many previous notions of smoothness while still allowing for a meaningful analysis of algorithms.
  • Convergence analysis with directional smoothness can potentially give some intuition which algorithms automatically adapt to (potentially beneficial) local smoothness properties.
  • The authors provide a wide range of properties for directional smoothness and different ways to compute upper bounds on the directional smoothness.
  • The sub-optimality gap for Polyak’s step-size, showing it automatically adapts to the local smoothness in a meaningful way, could potentially explain empirical success of the step-size schedule in practise.

缺点

  • Many of the results until Section 4.2 are direct consequences of the (directional) descent inequality. In particular, it was to be expected that the global smoothness inequality can be replaced with a localized version, and that results along the lines of Proposition 3.1 – 3.3 should hold. That being said, these results can be seen as preliminary results for the propositions / theorems in the later portion of the paper, which provide new insights.
  • There might be some questions regarding the soundness of Proposition 3.2, see Question 1. However, this result is not used in the consequent results and does hence not have any further implications.
  • Given this submission introduces a new way to analyse algorithms, seeing its application in different settings and for different algorithms would be highly beneficial to showcase the benefit of the proposed approach. All results in the work are however limited to the deterministic convex setting, and only Gradient Descent with different stepsizes is considered. Showcasing the approach for the stochastic setting or for different algorithms (such as Nesterov AGD) could give a more wholistic picture of the approach.

Edit after Author Rebuttal

The authors did fix the proof of Proposition 3.2.

问题

Questions

  1. In the proof of Proposition 3.2, why does the last inequality (above line 438) hold? More specifically, in the case ηk>μk1\eta_k > \mu_k^{-1}, the recursive argument cannot be applied as the sign is flipped. How do the authors deal with this case?
  2. Theorem 4.3 is very unclear to me. What is one “iteration” of GD? In order to reach a point x^T\hat{x}_T with f(x^T)f(x)εf(\hat{x}_T) - f(x^*) \leq \varepsilon, do you require O~(ε1)\widetilde{O}(\varepsilon^{-1}) or O~(ε2)\widetilde{O}(\varepsilon^{-2}) gradient evaluations? Maybe these unclear points could be addressed by including the full algorithm, producing the final output x^T\hat{x}_T, in Algorithm 1.
  3. The authors first derive suboptimality bounds for general stepsizes, before proving results for specific stepsizes. However, the proofs for Theorems 4.4 and 4.5 are not based on the general results, rather they are standalone proofs. Why did the authors not plug in these stepsizes into the general results? This would be a great opportunity to showcase the application of Propositions 3.1 – 3.3.
  4. Given directional smoothness generalises other non-uniform smoothness notions, did the authors examine whether their sub-optimality bounds reconstruct / improve previous results for notions of smoothness that can be modelled by directional smoothness? For example, can results from (L0,L1)(L_0, L_1)-smoothness [1] be reconstructed? The corresponding M(x,y)M(x,y) is specified in [2, Lemma A.3] / [3, Lemma 8].
  5. Is there hope for any lower bound results that show strongly adapted stepsizes are indeed the best (non-accelerated) stepsize one can choose in your generalized setting? Given such result, the Polyak-stepsize results would get even stronger.
  6. Are there any difficulties extending the notion of directional smoothness to the typical nonconvex setting and results, i.e. F(x)F(y)M(x,y)xy|| \nabla F(x) - \nabla F(y) || \leq M(x,y) || x-y|| and the corresponding GD guarantees?

[1]: Li et al., Convex and Non-convex Optimization Under Generalized Smoothness, NeurIPS 2023

[2]: Zhang et al., Improved Analysis of Clipping Algorithms for Non-convex Optimization, NeurIPS 2020

[3]: Hübler et al., Parameter-Agnostic Optimization under Relaxed Smoothness, AISTATS 2024

Remarks

  • There are multiple potential inconsistencies throughout the work. None of them have an impact on the soundness of the work, but it would be great if the authors could address them.

a) Most results use kk, some use TT while others use tt as the last iteration index.

b) The averaged iterate is sometimes denoted using hats while other times using widebars.

c) There are inconsistencies whether differentiability assumptions are explicitly stated or not. In sections 2 and 4 they are consistently mentioned, section 3 and its proofs in the appendix do not mention it even-though it is required.

d) Usage of f(x)f(x^*) vs. ff^*.

e) Appendix C denotes the Euclidian norm by || \cdot ||, while all other sections use 2|| \cdot ||_2.

f) All results in the appendix besides the three Theorems are numbered relative to the appendix, the theorems relative to the main section.

  • Lemma A.4 might better fit in Appendix B.
  • Proposition 3.2 could potentially be improved by using 2ηk[f(x)f(xk+1)]ηkμkΔk+12 \eta_k [f(x^*) – f(x_{k+1})] \leq -\eta_k \mu_k \Delta_{k+1} instead of removing the former term in the last inequality after line 435. This divides each term in the product by (1+ηkμk)(1+\eta_k\mu_k).
  • In it’s current version, Equation (41) does not provide additional information. The authors could either remove it or modify Equation (40) to include this step.
  • As mentioned in Line 562, the convergence results in Theorem 4.5 require some smoothness assumption such as LL-smoothness. This assumption is missing in the statement of Theorem 4.5.

Below some potential typos.

  • Proposition 3.1: δk\delta_k -> δk+1\delta_{k+1}
  • Proposition 3.2: Δk\Delta_k -> Δk+1\Delta_{k+1}
  • Theorem 3.4, Case 1: Missing parenthesis
  • Equation (23): xˉ\bar x -> xˉt\bar x_t
  • Last inequality in equation after Line 232: M(xk+1,xk)1M(x_{k+1},x_{k})^{-1} -> M(xk+1,xk)M(x_{k+1},x_{k})
  • Equation after Line 393: f(x)f(x) -> f(y)f(x)f(y) - f(x)
  • Second inequality in equation after Line 435: f(x)|| \nabla f(x) || -> f(x)2|| \nabla f(x) ||^2
  • Equation after Line 468: \Rightarrow -> \Leftarrow (or \Leftrightarrow)
  • Equations after line 475: MM -> DD
  • Algorithm 1, Line 1: L0L_0 -> η0\eta_0
  • Algorithm 1, Line 3: Wrong parentheses
  • Line 527: follow -> following

局限性

The introduction could more clearly mention that the paper (besides Lemma 2.4) exclusively focuses on the deterministic convex setting.

作者回复
  1. Many of the results until Section 4.2 are direct consequences of the (directional) descent inequality.

The reviewer has listed this as one of the weakness of our paper, but we would argue that this is a strength. It is this clear link between the definition of directional smoothness, the descent property, and the convergence rates that makes directional smoothness such an interesting definition. We were very surprised to see that such a natural definition had not been explored before.

  1. Stochastic setting or different algorithms (such as Nesterov AGD) could give a more wholistic picture of the approach.

We agree that having the stochastic setting would strengthen the paper. But since directional smoothness is a new concept, we decided on carefully analysing the deterministic setting first, and leaving the stochastic setting for future work.

As for analysing more algorithms, again we agree more would always strengthen the paper. Yet, we already analyse four different settings of GD with a constant step size, exponential line search, normalized gradient and Polyak step-size. These four settings already justify the utility of our new definition and also leave no additional room within a 9 page paper.

  1. In the proof of Proposition 3.2, why does the last inequality (above line 438) hold?

It looks like this is a minor bug. However, it's immediately corrected by upper-bounding as (1μkηk)Δk1μkηkΔk,(1 - \mu_k \eta_k) \Delta_k \leq |1 - \mu_k \eta_k| \Delta_k, since Δk0\Delta_k \geq 0. This doesn't affect the meaning of the proof since we need ηk1/μk\eta_k \leq 1 / \mu_k for any hope of progress. Many thanks for pointing it out.

  1. Theorem 4.3 is very unclear to me.

One iteration in this setting means one single gradient access. So to get f(x^T)f(x)εf(\hat{x}_T) - f(x^*) \leq \varepsilon, we require O~(ε1)\widetilde{O}(\varepsilon^{-1}) gradient accesses. The output of the algorithm x^\hat{x} would come from running gradient descent one final time after line 5 in Algorithm 1-- we will clarify this and write the algorithm in full for the next version of the paper. Thank you for drawing it to our attention.

  1. Why did the authors not plug in these step-sizes into the general results?

This is an excellent question. We would have preferred to proceed in this manner, but the Polyak step-size and the normalized GD step-size are not necessarily adapted to the directional smoothness. In other words, they are not descent methods. Instead, Polyak and normalized GD make progress by decreasing the distances Δk=xk=x22\Delta_k = \\|x_{k} = x^*\\|_2^2. As a result, we require dedicated proofs which leverage the specific structure of each algorithm.

  1. Can we reconstruct/improve results for other notions of smoothness using directional smoothness?

Thank you for raising this question relating to (L0,L1)(L_0,L_1)--smoothness and point outing [2, Lemma A.3] and [3, Lemma 8]. Looking to Lemma A.3, though at face value it does fit our definition of directional smoothness with, M(xk+1,xk):=(AL0+BL1f(xk))/2,M(x_{k+1},x_k) : = (AL_0+B L_1 \\|\nabla f(x_k)\\|)/2, we see an issue that prevents us from applying our theory. The issue is related to the constants AA, L1L1 and BB which actually depend on all iterations of the algorithm. This is because the result of Lemma A.3 holds under the assumption that xkxk+1c/L1,\\|x_k -x_{k+1}\\| \leq c/L_1, for all iterations kk, and for fixed constants cc and L1L_1. In other words, we have that, maxkxkxk+1c/L1,\max_k \\|x_k -x_{k+1}\\| \leq c/L_1, and c/L1c/L_1 depends on the entire trajectory of the underlying algorithm. Furthermore, the constants AA and BB depend on this constant cc since, A=1+ecec1c,A = 1 + e^c -\frac{e^c-1}{c}, and, B=ec1c.B =\frac{e^c-1}{c}. Consequently the quantity (AL0+BL1f(xk))/2(A L_0 + B L_1 \\|\nabla f(x_k)\\|)/2 depends on all iterations, and thus doesn't fit our definition of directional smoothness. We will continue to consider possible connections, and write a comment about this in our updated draft.

7. Is there hope for any lower bound results that show strongly adapted stepsizes are indeed the best?

This is an interesting and subtle problem since showing meaningful lower-bounds requires a minimum of variation in the curvature of ff. For example, if f(x)=0.5xx22f(x) = 0.5 \\|x - x^*\\|_2^2, then the Polyak step-size is just ηk=1/2\eta_k = 1/2 and there is no difference between Polyak and vanilla GD. This indicates that we may need to exclude "simple" problems like isotropic quadratics from the function class in order to prove that strongly adapted step-sizes lead to strictly improved rates. Since this is highly non-trivial, we leave it to future work.

8. Are there any difficulties extending the notion of directional smoothness to the typical nonconvex setting?

Please see the general response.

9. Typographical issues and inconsistencies:

Thanks for pointing these out. We will correct them for the camera ready deadline.

评论

I would like to thank the authors for taking the time to respond to all reviewers. Some of my questions were answered completely, for some others I have a few follow-up questions to ensure a correct understanding on my side.

Direct consequence of Descent Inequality

I see the authors' perspective and agree that formalising a long-standing insight can be valuable if it leads to new, previously unknown insights. At the time of my review, I believed that the Polyak stepsize result could be one such insight. However, the comment from reviewer Z7fB raises some questions. Specifically, while I agree that your result has a different flavour by demonstrating adaptivity to an unknown smoothness function, the final result only holds for the weighted average xˉt\bar{x}_t. The weights in this average require knowledge of the (unknown) smoothness function.

Does the result of Theorem 4.4 hinge on this weighted average? Asked differently, can the authors extend the result to show that Polyak stepsizes lead to the same convergence without requiring knowledge of MM to output a final iterate?

Stochastic setting or different algorithms (such as Nesterov AGD) could give a more wholistic picture of the approach

I appreciate the authors for providing sub-optimality bounds for four different algorithms using their introduced concept. However, all four algorithms adhere to the classical GD paradigm xt+1xtηtf(xt)x_{t+1} \gets x_t - \eta_t \nabla f(x_t). On a high level, all convergence guarantees demonstrate that this new concept can be applied to such algorithms.

Since the work introduces a new concept, it might arguably be more important to convince readers of its significance and applicability. By limiting all results to such type of algorithms and the deterministic convex setting, it raises the question of whether this is the only setting that can be addressed using this concept. This might in turn explain the lack of previous publications on this straightforward generalisation. The authors response regarding the applicability to the non-convex setting underscores this concern.

Could the authors address this concern?

Proof of Proposition 3.2

I would like to thank the authors for their correction of the proof, with which I agree. I will increase my Soundness score accordingly and remove the corresponding weakness. Please ensure the fix is included in future versions of the work.

(L0,L1)(L_0, L_1)-Smoothness

I fear that Lemma A.3 – while being the first to derive such result – does not state the most general version of the result. The result has been simplified for presentation purposes. For a more general version, see [1, Lemma 8]. Specifically, the function MM is given by

M(x,y)=B0(L1xy)L0+B1(L1xy)L1f(x),M(x,y) = B_0 (L_1 ||x-y||)L_0 + B_1 (L_1||x-y||) L_1 ||\nabla f (x)||,

where

B0(c)=1+2ec1c4ec1cc2,B1(c)=2ec1cc2.B_0(c) = 1 + 2\frac{e^c - 1} c - 4 \frac{e^c - 1 - c}{c^2}, \qquad B_1(c) = 2\frac{e^c - 1 - c}{c^2}.

It would be great if the authors could revisit the question using the appropriate Definition / Lemma.


[1] Hübler et al., Parameter-Agnostic Optimization under Relaxed Smoothness, AISTATS 2024

评论

Additional References

[1] Hübler et al., Parameter-Agnostic Optimization under Relaxed Smoothness, AISTATS 2024

[2] Zhang et al., Improved Analysis of Clipping Algorithms for Non-convex Optimization, NeurIPS 2020

[3] Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, Makoto Yamada, Polyak Meets Parameter-free Clipped Gradient Descent, arXiv, May 2024

[4] Peter Richtárik, Simone Maria Giancola, Dymitr Lubczyk, Robin Yadav, Local Curvature Descent: Squeezing More Curvature out of Standard and Polyak Gradient Descent, arXiv, May 2024

评论

I would like to express my gratitude to the authors for addressing my questions.

While I can believe that Nesterov’s result allows to extend the result to accelerated regimes, it seems to be more of an existence result rather than a showcase of the applicability of Directional Smoothness to different algorithms. I hence still believe that this is a concern and it prevents me from giving a score higher than a weak accept.

I am pleased to hear that the authors were able to reconstruct and even strengthen results that are a special case of their notion of smoothness. Regarding the first point, could the authors brievly outline how they show that k=0T1M(xk+1,xk)O~(T)\sum_{k=0}^{T-1} M(x_{k+1},x_k) \in \tilde{O}(T) in the giving setting to reconstruct the cited result? It does not seem to be trivial to me.

I will temporarily increase my score to a weak accept, finalising the score after a response on the above technical question.

评论

Thank you again for engaging, and also thank you for raising your score. With regards to your remaining technical question.

[Bounded average directional smoothness] Your questions is “when will Mˉ_T:=_k=0T1M(xk+1,xk)/T=O(1)\bar{M}\_T := \sum\_{k=0}^{T-1} M(x_{k+1},x_k)/T = O(1)“. The answer depends on what we assume. If we consider the same assumptions as Theorem 4 in [2], then the objective function is globally LL smooth and k=0T1M(xk+1,xk)/TL=O(1)\sum_{k=0}^{T-1} M(x_{k+1},x_k)/T \leq L = O(1), in which case our Theorem 4.4 generalizes Theorem 4 in [2], as it proves a O(1/T)O(1/T) convergence for every type of directional smoothness. Alternatively, if we drop the global L smoothness assumption, then we need to assume that Mˉt\bar{M}_t is bounded. This is weaker than assuming global L smoothness, thus we would have both generalized this O(1/T)O(1/T) in Theorem 4 in [2] to all directional smoothness functions, and also extended the class of functions to which is applies (to potentially non-globally L smooth functions).

Please let us know if this is clear.

评论

I appreciate the authors efforts in answering the technical question. My excitement stemmed from their claim that the results in their submission improved upon [1, Theorem 4] by removing the L-smoothness assumption.

However, the authors' technical response did not meet my initial expectations. Instead of completely removing the LL-smoothness assumption, the authors only manage to weaken it to a uniform bound on the smoothness between consecutive points. As reviewer Z7fB pointed out, it is well-known in the community that smoothness along the trajectory is sufficient for the convergence of Gradient Descent type algorithms in the deterministic setting, see e.g. the lecture notes [2, Lemma 3.7]. I do however want to mention that reconstructing [1, Theorem 4] is still an exciting result, as [1] is a concurrent work.

While it is a difficult decision, I will maintain my current weak acceptance, with a tendency toward a borderline acceptance.


[1] Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, Makoto Yamada. Polyak Meets Parameter-free Clipped Gradient Descent. arXiv. May 2024.

[2] Bernd Gärtner, Niao He, Martin Jaggi. Optimization for Data Science Lecture Notes, FS 23. June 2023.

评论

Thank you for following-up.

Smoothness along the trajectory is sufficient for the convergence of Gradient Descent type algorithms

Thank you for pointing to the specific reference of Lemma 3.7 [2]. Lemma 3.7 mentions that the typical descent criteria "already holds if ff is smooth with parameter LL over the line segment connecting xtx_t and xt+1x_{t+1}." But their proof technique for the convergence of gradient descent (Theorem 3.8) does not hold for local/path-wise (here depends on the specific definition) or directional smoothness. More broadly, we are unaware of any such work that establishes proofs for GD with path-wise directional smoothness. In the optimization community, there is an understanding the GD is a local algorithm, and should depend on local smoothness. But that not the same as determining the proper definition of local/directional smoothness, and writing down the proofs of convergence.

It seems to be more of an existence result rather than a showcase of the applicability of Directional Smoothness to different algorithms

We outline below the proofs for two accelerated results and the specific guarantees they obtain. We believe these two new theorems illustrate the power of directional smoothness for refining the analyses of existing algorithms and go beyond an existence result.

In Equation 2.2.7 (General Scheme of Optimal Method), Nesterov [N18] shows that the only property required of the primal update for acceleration is that the descent lemma holds (Eq. (c) in 2.2.7). If we have an adapted step-size ηk1/M(xk+1,xk)\eta_k \leq 1/M(x_{k+1}, x_k), then the directional descent lemma implies,

f(xk+1)f(xk)ηk2f(xk)22.f(x_{k+1}) \leq f(x_k) - \frac{\eta_k}{2} \\|\nabla f(x_k)\\|_2^2.

By replacing Eq. (c) with this inequality and changing Eq. (a) to be

αk2/ηk=(1αk)γk+αkμ,\alpha_k^2/\eta_k = (1 - \alpha_k)\gamma_k + \alpha_k \mu,

we obtain the same guarantee as Theorem 2.2.1 [N18], but with a modified λk\lambda_k. In particular, λk\lambda_k now depends on the directional smoothness (rather than on the global smoothness LL) through the αk\alpha_k sequence.

If μ>0\mu > 0 and γ0=μ\gamma_0 = \mu, then γk=μ\gamma_{k} = \mu for every kk, αk=μηk\alpha_k = \sqrt{\mu \eta_k}, and thus λk=i=0k(1μηi)\lambda_k = \prod_{i=0}^k (1 - \sqrt{\mu \eta_i}). As a result, we obtain the following guarantee:

Theorem: (Strongly-convex acceleration) If ff is μ>0\mu > 0 strongly-convex and ηk1/M(xk+1,xk)\eta_k \leq 1/M(x_{k+1}, x_k) for every kk, then GD with Nesterov acceleration converges at the following rate:

f(x_k) - f(x^*) \leq \left $ \prod_{i=0}^k (1 - \sqrt{\mu \eta_i})\right $ \left(f(x_0) - f(x^*) + \frac{\mu}{2}\\|x_0 - x^*\\|_2^2\right)

This is at least as fast the standard strongly-convex accelerated convergence rate.

On the other hand, if γ03maxiMi\gamma_0 \leq 3 \max_i M_i, then by slightly modifying Nesterov's Lemma 2.2.4 allows us to get λk4maxiMiγ0(k+1)2\lambda_k \leq \frac{4 \max_i M_i}{\gamma_0 (k+1)^2}, which let's us prove the following rate.

Theorem: (Convex acceleration) If ff is convex, γ03maxiMi\gamma_0 \leq 3 \max_i M_i, and ηk1/M(xk+1,xk)\eta_k \leq 1/M(x_{k+1}, x_k) for every kk, then GD with Nesterov acceleration converges at the following rate:

f(xk)f(x)4maxiMiγ0(k+1)2(f(x0)f(x)+μ2x0x22).f(x_k) - f(x^*) \leq \frac{4 \max_i M_i}{\gamma_0 (k+1)^2}\left(f(x_0) - f(x^*) + \frac{\mu}{2}\\|x_0 - x^*\\|_2^2\right).

Once again, this rate is at last as fast as the standard convex accelerated convergence rate.

评论

Dear Authors

I acknowledge your response and will take it into consideration for the discussion phase.

评论

Thank you for engaging with our response! Please let us know if the following answers address your concerns.

Does the result hinge on this weighted average?

Thank you for this relevant question. As we note in Line 228 and after Eq. 25, the weighted average of iterates using the (unknown) smoothness function is always lower-bounded (in ff) by the best iterate, so we can replace xˉ\bar x with argminkf(xk)\text{argmin}_{k} f(x_k) without changing the guarantee. Thus, the best iterate is adaptive to any directional smoothness function without any knowledge of MM. If we have access to function value evaluations, calculating the argmin is no more difficult than calculating the weighted average.

Applicability to other settings.

We understand your concerns. We can extend our results to Nesterov acceleration to obtain optimal rates which only depend on the directional smoothness. Nesterov [N18] shows that any gradient method making progress like,

f(xk+1)f(xk)ηk2f(xk)22.f(x_{k+1}) \leq f(x_k) - \frac{\eta_k}{2} \\|\nabla f(x_k)\\|_2^2.

can be generically accelerated using an estimating sequences argument. This condition is satisfied by gradient descent with step-sizes adapted to a directional smoothness function MM. Thus, GD with adapted step-sizes can be accelerated to obtain optimal linear/sub-linear rates with a dependence on the maximum directional smoothness observed over the optimization trajectory. Such a result can only improve on the classical accelerated rate.

We will add a formal proof of this result to the paper.

We'd like to reiterate that it has not been established in prior work that adapted stepsizes exist, or that they can be found easily. While our provided counterexample means that in general non-convex optimization might not be amenable to an easy analysis, this does not rule out that it can be useful under extra assumptions (e.g. that the Hessian is locally Lipschitz, or with directional smoothness functions that can potentially depend on more than two points).

[N18] Nesterov, Yurii. Lectures on convex optimization. Vol. 137. Berlin: Springer, 2018.

Proof of Proposition 3.2

Thanks again for catching this issue.

(L0,L1)(L_0, L_1) smoothness and Lemma [1, Lemma 8]

Thank you for pointing us to this more modern result and reference. We now agree that indeed that (L0,L1)(L_0, L_1) smoothness is a valid directional smoothness function according to our definition. This is actually a very interesting observation, which has several consequences including expanding the applications of directional smoothness, and explaining recent work on (L0,L1)(L_0, L_1) smoothness and the Polyak step-size.

  1. [Polyak step size and (L0,L1)(L_0, L_1) smoothness] During the NeurIPS deadline (May 2024), two works appeared online showing that the Polyak step-size can be analysed under (L0,L1)(L_0, L_1) smoothness [3], and a related local curvature condition [4]. In particular Theorem 4 in [4] establishes that if the objective f(x)f(x) is convex, smooth and (L0,L1)(L_0, L_1) smooth, that GD with a Polyak step size converges at a O(1/T)O(1/T) rate with favourable constants. But now that it is clear that (L0,L1)(L_0, L_1) smoothness is a directional smoothness, we can immediately improve up on Theorem 4 in [2] in two ways: (a) we now know that assuming L smoothness (in addition to (L0,L1)(L_0, L_1) smooth) is not needed, and (b) this O(1/T)O(1/T) result holds not just for (L0,L1)(L_0, L_1) smoothness, but for every directional smoothness constants as we have shown in Theorem 4.4. We also believe this will have consequences on the related definition of local curvature and the results in [4]. In other words, directional smoothness is a generalization of (L0,L1)(L_0,L_1) smoothness that also guarantees convergence at the same rate.

  2. [Non-convex path forward] Both [3] and [4] consider the non-convex setting and analyse variants of normalized gradient descent with momentum. Now that we know that (L0,L1)(L_0, L_1) smooth is a particular type of directional smoothness, we can consider generalizing their result to any type of directional smoothness function, thus giving us a clear way forward for the non-convex setting. To do this, we believe we will be able to leverage our Lemma D.4. Yet it is not immediately obvious how to do this, so we cannot promise any substantial result in this paper.

We sincerely thank the reviewer for bringing this up. We will include now an extended remark on this connection and consequences to [3] and [4]. We will also add an acknowledgement to Reviewer vUdK for pointing this out to us, and for a insightful review.

审稿意见
6

This paper develops refined sub-optimality bounds for gradient descent in the convex setting. The authors consider directional smoothness, a local and path dependent smoothness condition, instead of assuming globally bounded smoothness constants in classical analyses. They discussed several interesting examples of directional smoothness functions, and derived sub-optimality bounds with them. For convex quadratic problems, they show that the stepsizes that minimize the upper bounds can be easily computed, which leads to new guarantees for two classical stepsizes. For general convex functions, they show that ideal stepsizes exist and can be computed using Newton's method or by exponential search. They also show that Polyak stepsize and normalized GD can achieve fast and path-dependent rates without knowing the directional smoothness.

优点

  1. The sub-optimality bounds in terms of directional smoothness and the stepsizes and convergence results following the upper bounds look novel. The perspective is quite interesting.
  2. The proofs look correct and the results look reasonable. The authors have thoroughly studied relevant aspects and details of their theory.
  3. The authors did a good job connecting their results with existing methods and theories, which is quite insightful.
  4. The writing and presentation are also very good. I enjoyed reading the paper.

缺点

  1. The final convergence rates are path-dependent. Although they are more refined and should be better than classical rates, it is not clear how large the gap is. I wonder if there are any interesting applications for which their rates can be more explicit and show clear improvement over classical ones.

问题

  1. Have the authors considered non-convex settings? In fact, since the authors mentioned neural networks in the first paragraph of the introduction, I wonder if this directional smoothness can be used to explain certain phenomenons in deep neural network training, like edge of stability or benefits of adaptivity.

Some potential typos: a. Equation (3): missing "+" b. Line 194: it should be the inverse of Hessian? c. Theorem 4.3. Case 1: missing ")"

局限性

The authors have adequately addressed limitations.

作者回复

1. I wonder if there are any interesting applications for which their rates can be more explicit and show clear improvement over classical ones.

Yes, and we provide such an example in Section 4.1 for quadratics. In this example, instead of the convergence rate relying on the largest eigenvalue of the constant Hessian matrix, our results show that the convergence rate relies instead on Rayleigh coefficient with respect to the local gradient. This Rayleigh coefficient is smaller than the largest eigenvalue, unless the local gradient happens to be exactly the eigenvector associated to the largest eigenvalue.

2. Have the authors considered non-convex settings?

Please see the general response.

3. Connections to edge of stability and adaptivity.

There are close connections betweens directional smoothness, edge of stability, and the performance of adaptive gradient methods. Kwangjun et al. [1] propose an alternative definition of directional smoothness (which is closely related to our path-wise smoothness) and use this quantity to study the behavior of SGD in stable and unstable regimes. Similarly, Yan et al. [2] consider a directional sharpness measure. They provide experiments showing that directional sharpness is lower for adaptive methods like Adam than for gradient descent. We will add these references to the paper.

评论

Thank you for the response! I have also read other reviewers and would like to keep my score.

作者回复

1. Extensions to the non-convex setting.

Deriving meaningful results for non-convex functions is challenging. For example, we can immediately use the directional descent lemma to obtain,

\\begin{aligned} \\eta_k (1 - \\frac{\\eta_k M_k}{2}) \\| \\nabla f(x_k) \\|^2_2 & \\leq f(x_k) - f(x_{k+1}) \\\\ \\implies \\frac{1}{K} \\sum_{k=0}^{K-1} \\eta_k(1 - \\frac{\\eta_k M_k}{2})\\|\\nabla f(x_k)\\|_2^2 &\\leq \\frac{1}{K}(f(x_0) - f(x_K))\\end{aligned}

This inequality shows that the main challenge is computing (strongly) adapted step-sizes such that ηk1/Mk\eta_k \leq 1 / M_k. While the majority of our paper is concerned exactly with this problem, our approaches rely on convexity; the only result we can obtain for the non-convex setting is to use path-wise smoothness and apply Prop. 4.2 along with a root-finding method. Such an approach is not practical because we cannot easily calculate the path-wise smoothness.

In contrast, in the convex setting we can easily calculate the point-wise directional smoothness (as given by Lemma 2.2). Unfortunately, the analog of Lemma 2.2 is false in the non-convex setting. Consider the function f(x)=x22+2e(x12)2/2f(x) = \frac{x^2}{2} + 2e^{-(x-\frac{1}{2})^2/2}. This function is non-convex due to the Gaussian bump centered at x=12x=\frac{1}{2} with amplitude 2. For x=1.14x = 1.14 and y=1.05y=-1.05, we have, f(x)2.2794>1.8352f(y)+f(y),xy+2f(x)f(y)xy.f(x) \approxeq 2.2794 > 1.8352 \approxeq f(y) + \langle \nabla f(y), x-y \rangle + 2 \\|\nabla f(x) - \nabla f(y)\\| \\|x-y\\|. If the reviewers request, we can include this counterexample in the paper to highlight the added difficulties in the non-convex setting.

We intend to develop theory for the non-convex setting, but, because non-convex functions require new adapted step-size schemes, we believe that this is beyond the scope of our paper.

References

[1] Ahn, Kwangjun, Jingzhao Zhang, and Suvrit Sra. "Understanding the unstable convergence of gradient descent." International Conference on Machine Learning. PMLR, 2022.

[2] Pan, Yan, and Yuanzhi Li. "Toward Understanding Why Adam Converges Faster Than SGD for Transformers." OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop).

[3] Bubeck, Sébastien. "Convex optimization: Algorithms and complexity." Foundations and Trends® in Machine Learning 8.3-4 (2015): 231-357.

[4] N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306–1314, 2021.

[5] Hazan, E. And Kakade, S., 2019. Revisiting the Polyak step size. ArXiv preprint arXiv:1905.00313.

最终决定

By leveraging directional smoothness—a local, path-dependent condition—rather than global smoothness as in classical analyses, this paper presents new sub-optimality bounds for gradient descent (GD). While computing a step-size adapted to the directional smoothness involves solving a challenging implicit equation, the paper shows that for convex quadratic functions, this equation admits a closed-form solution. For general convex functions, the paper demonstrates that the Polyak step-size and normalized GD achieve fast, path-dependent rates without requiring knowledge of directional smoothness. Experimental results are provided to validate the theoretical convergence guarantees.

Overall, the reviewers find the contributions of the paper to be interesting and valuable. However, several suggestions/questions/concerns have been raised during the review process, which the authors are strongly encouraged to incorporate into their final revision.