5.5

/10

Poster4 位审稿人

最低3最高7标准差1.7

4.3

置信度

正确性3.0

贡献度2.5

表达2.8

NeurIPS 2024

Parameter-free Clipped Gradient Descent Meets Polyak

Yuki Takezawa,Han Bao,Ryoma Sato,Kenta Niwa,Makoto Yamada

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

TL;DR

We proposed parameter-free methods whose convergence rate is asymptotically independent of $L$ under $(L_0, L_1)$-smoothness

摘要

Gradient descent and its variants are de facto standard algorithms for training machine learning models. As gradient descent is sensitive to its hyperparameters, we need to tune the hyperparameters carefully using a grid search. However, the method is time-consuming, particularly when multiple hyperparameters exist. Therefore, recent studies have analyzed parameter-free methods that adjust the hyperparameters on the fly. However, the existing work is limited to investigations of parameter-free methods for the stepsize, and parameter-free methods for other hyperparameters have not been explored. For instance, although the gradient clipping threshold is a crucial hyperparameter in addition to the stepsize for preventing gradient explosion issues, none of the existing studies have investigated parameter-free methods for clipped gradient descent. Therefore, in this study, we investigate the parameter-free methods for clipped gradient descent. Specifically, we propose Inexact Polyak Stepsize, which converges to the optimal solution without any hyperparameters tuning, and its convergence rate is asymptotically independent of $L$ under $L$-smooth and $(L_0, L_1)$-smooth assumptions of the loss function, similar to that of clipped gradient descent with well-tuned hyperparameters. We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5.

关键词

Polyak stepsizeclipped gradient descentgeneralized smoothness

评审与讨论

审稿意见

评分: 7置信度: 52024-06-18

I think this is a solid paper, interesting and relevant to the community. As somebody who has worked and published on Polyak step sizes in the past, I find the new results interesting regarding the convergence of the Polyak step under the $(L_0,L_1)$ -smooth condition. That helps explain some of the surprisingly fast practical convergence of Polyak step sizes. The new inexact Polyak step size is also interesting. It has clear limitations - it's 1/sqrt(T) rate - but given it's convergence rate holds without knowledge of f*, it is still very interesting. The holy grail here is log dependence on f* misspecification, so there is still some ways to go.

The experiments are inline with my expectations, to DoG method is know to be unstable on many problems and that's seen here. I always like to see more and larger experiments but I think these are sufficient, they cover some main problem classes and use multiple seeds.

优点

Clear and well explained theoretical results.
Puts results into context well.
Clear legible and relevant experiments.

缺点

Some awkward or ungrammatical wording in places.
Inexact Polyak Step-size convergence rate is somewhat slow.
Dependence on estimate of key estimated parameter (f*) is not as good as D-Adaptation/T-DoG.
Stochastic case

问题

N/A

局限性

See above

作者回复

2024-08-07

We thank the reviewer for your quite positive evaluation of our paper.

Some awkward or ungrammatical wording in places.

We will revise them in the camera-ready version by using an English proofreading service.

Inexact Polyak Step-size convergence rate is somewhat slow.

As the reviewer pointed out, Inexact Polyak Stepsize slows down the convergence rate to $\mathcal{O}(1 / \sqrt{T})$ compared with the clipped gradient descent with appropriate hyperparameters (Theorem 2). However, the study for parameter-free clipped gradient descent is a new research topic, and the lower bound for parameter-free clipped gradient descent remains unclear. We believe our discovery and proposed method will benefit future research for parameter-free clipped gradient descent. Especially, revealing that the convergence rates of clipped gradient descent and Polyak stepsize are the same allows us to reduce the number of hyperparameters to be considered from two (i.e., stepsize and clipping threshold) to one (i.e., $f^\star$ ), which will be quite useful for future studies.

Dependence on estimate of key estimated parameter (f*) is not as good as D-Adaptation/T-DoG.

D-Adaptation and DoG try to estimate $\| x^{(0)} - x^\star \|$ , while Polyak-type parameter-free methods, including our proposed method, try to estimate $f^\star$ by utilizing the lower bound $l^\star$ . The study of parameter-free methods is still a nascent topic in the optimization community, and it is yet unclear which estimation is easier. Therefore, we believe it would be useful for future research to study methods other than DoG and D-Adaptation.

Stochastic case

We agree that the stochastic setting is important, but even the convergence analysis of clipped gradient descent has not yet been established in a stochastic and convex setting [1]. Thus, we first need to establish a novel proof strategy for analyzing clipped stochastic gradient descent to create parameter-free clipped stochastic gradient descent, but it is out of scope in our paper. We believe that it is one of the promising directions to analyze Inexact Polyak Stepsize in the stochastic setting, while we left this problem for future work.

More technically, the analysis of clipped gradient descent first derived the upper bound of $\sqrt{f(x_t) - f^\star}$ and then derived the upper bound of $f(x_t) - f^\star$ by squaring both sides (see page 13 in [1]). If we extend this proof strategy to the stochastic setting, we get the upper bound of $\mathbb{E} \sqrt{f(x_t) - f^\star}$ . Therefore, we get the upper bound of $(\mathbb{E} \sqrt{f(x_t) - f^\star})^2$ , which is not a desired upper bound because we would like to obtain the upper bound of $\mathbb{E} f(x_t) - f^\star$ . Thus, we need to establish a different proof strategy to analyze the clipped stochastic gradient descent.

Reference

[1] Koloskova et. al., Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In ICML 2023

审稿意见

评分: 5置信度: 42024-06-19

The work extends the convergence results for the Polyak stepsize to ( $L_0$ , $L_1$ )-smoothness (and convex), showing a rate of $\mathcal O(\tfrac{L_0}{T} + \tfrac{LL_1^2}{T^2})$ . To remove the dependency on knowing the optimal function value $f^\star$ , they introduce a horizon dependent stepsize factor $1/\sqrt{T}$ . Not surprisingly, this comes at the cost of "deaccelerating" the rate, specifically to $\mathcal O(\tfrac{L_0+\sigma^2}{\sqrt{T}} + \tfrac{LL_1^2}{T}+\tfrac{LL_1^2\sigma^4}{L^2_0 T})$ where $\sigma$ is the error for the guess of $f^\star$ .

优点

It is interesting that Polyak stepsize can recover the rates of clipped GD under ( $L_0$ , $L_1$ )-smoothness, thus removing the need for tuning the clipping parameter and stepsize in this setting as long as $f^*$ is known. It is not too surprising that Polyak stepsize converges for ( $L_0$ , $L_1$ )-smoothness and convex (Polyak stepsize is Fejer monotone under star-convexity alone after all), but the fact that the rate matches those of clipped GD seems interesting.

缺点

The horizon dependent stepsize $1/\sqrt{T}$ (to relax knowledge of $f^\star$ to knowledge of a lower bound $l^\star \leq f^\star$ ), leads to a slow $\mathcal O(1/\sqrt{T})$ rate, which is problematic. Since the work are also assuming knowledge of a lower bound on $f^\star$ , why not use the successive halving strategy originally used for Polyak stepsize in Hazan and Kakade 2019 (Alg. 3)? I imagine it should be able to show a similar result to their Thm. 2, which only suffers a logarithmic factor in the complexity (instead of deaccelerating the scheme). Instead of comparing with parameter-free stochastic Polyak methods in l. 172-178, why not compare with the more appropriate deterministic strategy of Hazan and Kakade 2019 (Alg. 3)?
The connection with gradient clipping seems largely misleading. The paper makes it seem as if the results are concerning gradient clipping (e.g. the paper title, the title of section 4, l. 237). Instead the results directly concern Polyak stepsize, why not simply present it as such?
Stochasticity is not treated or discussed even though it seems like a major motivation from the experiments.

The main contribution seems to be showing rates for Polyak stepsize under ( $L_0$ , $L_1$ )-smoothness (and convexity), which does not seem sufficient in itself. The direction is interesting, but in its current state the paper has too many either unnecessary of misleading parts (as mentioned above).

问题

See weaknesses.

局限性

A couple of important points are not discussed:

The theoretical results are only for deterministic even though stochastic experiments are considered. One should expect only convergence to a neighborhood. If this is not formalized in a theorem, I suggest at least making this theory/practice gap explicit (especially in contrast with the baseline like AdaSPS that have been developed for the stochastic case).
The $1/\sqrt{T}$ factor in the stepsize deteriorates the rate should be discussed

作者回复

2024-08-07

We thank the reviewer for your constructive criticisms of our paper.

The horizon dependent stepsize $1/\sqrt{T}$ (to relax knowledge of $f^\star$ to knowledge of a lower bound $l^\star \leq f^\star$ ), leads to a slow $O(1/\sqrt{T})$ rate, which is problematic. Since the work are also assuming knowledge of a lower bound on $f^\star$ , why not use the successive halving strategy originally used for Polyak stepsize in Hazan and Kakade 2019 (Alg. 3)? [...]

Thank you for the suggestion. We will add the following discussion in the revised manuscript.

Our paper focused on the deterministic setting for theory, while Inexact Polyak Stepsize also works in practice for neural networks, as demonstrated in our experiments. In contrast to Inexact Polyak Stepsize, Adaptive Polyak [2] requires computing the loss value, which is not immediately extendable to the stochastic setting. Thus, Adaptive Polyak is not applicable in the stochastic setting due to its methodological issue. Though our theory only holds in the deterministic setting, our ultimate goal is to develop methods that also work for training neural networks. Thus, we utilized the idea of DecSPS and AdaSPS instead of Adaptive Polyak, proposing Inexact Polyak Stepsize in our paper.

The connection with gradient clipping seems largely misleading. The paper makes it seem as if the results are concerning gradient clipping (e.g. the paper title, the title of section 4, l. 237). Instead the results directly concern Polyak stepsize, why not simply present it as such?

We wonder if the reviewer misunderstood our motivation to study Polyak stepsize. Let us clarify that our paper attempts to propose a parameter-free clipped gradient descent method and leverages the Polyak stepsize as its basis. Here, we briefly summarize the motivation of our paper. By slightly changing the notation, clipped gradient descent can be reformulated as follows:

\mathbf{x}_{t+1} = \mathbf{x}_t - \tilde{\eta}_t \frac{\nabla f (\mathbf{x}_t)}{\| \nabla f (\mathbf{x}_t) \|},

where $\tilde{\eta}_t = \eta \min\\{ \| \nabla f (\mathbf{x}_t)\|, c \\}$ . This reformulation allows us to consider parameter-free methods for $\tilde{\eta}_t$ instead of $\eta$ and $c$ in clipped gradient descent. The gradient norm $\|\nabla f(\mathbf{x}_t)\|$ in the denominator is reminiscent of Polyak stepsize, which motivated us to analyze Polyak stepsize under $(L_0, L_1)$ -smoothness. Then, we uncovered that Polyak stepsize achieves the same convergence rate as clipped gradient descent. While clipped gradient descent uses two hyperparameters $\eta$ and $c$ , Polyak stepsize requires only one hyperparameter $f^\star$ . We explored parameter-free methods for Polyak stepsize (i.e., $f^\star$ ) to study parameter-free clipped gradient descent, leading to our proposal of Inexact Polyak Stepsize. We will revise our manuscript to make this motivation clear to readers.

We are also wondering if the paper title may have misled the reviewers. We will change the word order in the title and the paper title to "Parameter-free Clipped Gradient Descent Meets Polyak" to better reflect our motivation and avoid misunderstandings.

Stochasticity is not treated or discussed even though it seems like a major motivation from the experiments.

We will add the following discussion on stochasticity to the revised manuscript.

We agree that the stochastic setting is important, but we would like to emphasize that parameter-free methods for clipped gradient descent are already challenging even in the deterministic setting. Furthermore, clipped gradient descent does not converge to the optimal solution in the stochastic setting (see Theorems 3.1 and 3.2 in [1]), which makes it more difficult to develop parameter-free clipped gradient descent. Our paper focused on the deterministic setting, but we believe that our findings will be helpful in future studies of parameter-free clipped gradients in the stochastic setting. See also the reply to the next question.

[...] One should expect only convergence to a neighborhood. If this is not formalized in a theorem, I suggest at least making this theory/practice gap explicit [...]

We would like to note that even the convergence analysis of clipped gradient descent has not been established in a stochastic and convex setting [1]. Thus, we first need to establish a novel proof strategy for analyzing clipped stochastic gradient descent to analyze Inexact Polyak Stepsize in the stochastic setting, but it is out of scope in our paper. We believe that it is one of the promising directions to analyze Inexact Polyak Stepsize in the stochastic setting, while we left this problem as a future work.

More technically, the analysis of clipped gradient descent first derived the upper bound of $\sqrt{f(x_t) - f^\star}$ and then derived the upper bound of $f(x_t) - f^\star$ by squaring both sides (see page 13 in [1]). If we extend this proof strategy to the stochastic setting, we get the upper bound of $\mathbb{E} \sqrt{f(x_t) - f^\star}$ and obtain the upper bound of $(\mathbb{E} \sqrt{f(x_t) - f^\star})^2$ . However, the upper bound of $(\mathbb{E} \sqrt{f(x_t) - f^\star})^2$ is not a desired one because we would like to obtain the upper bound of $\mathbb{E} f(x_t) - f^\star$ . Therefore, we need to establish a different proof strategy to analyze the clipped stochastic gradient descent.

The $1/\sqrt{T}$ factor in the stepsize deteriorates the rate should be discussed

Thank you for the suggestion. We promise to discuss in the revised version the convergence rate of Inexact Polyak Stepsize being deteriorated to $O(\frac{1}{\sqrt{T}})$ by the $\frac{1}{T}$ factor in the stepsize.

Reference

[1] Koloskova et. al., Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In ICML 2023

[2] Hazan et. al., Revisiting the Polyak step size. In arXiv 2019

2024-08-09

I thank the authors for their rebuttal, but my main concern remains:

In contrast to Inexact Polyak Stepsize, Adaptive Polyak [2] requires computing the loss value, which is not immediately extendable to the stochastic setting

Inexact Polyak Stepsize also requires computing the loss value, so I don't understand why Inexact Polyak Stepsize does not also suffer from the same issue as Adaptive Polyak in the stochastic case.

To me the unknown $f^\star$ case is incomplete and it would have been better to either: i) consider the deterministic case and derive rates with only a log factor following Hazan and Kakade 2019 ii) consider the stochastic case (to exclude Hazan and Kakade 2019).

Relationship to gradient clipping

My understanding is still that the results are related to the Polyak stepsize and that it does not exploit the connection (correct me if I'm wrong). The contribution does not seem to be introducing a new (adaptive) clipping scheme, but rather analyzing a well known scheme (the Polyak stepsize) under new conditions.

I still believe that the main contribution is to show matching rates to those of clipped GD under ( $L_0$ , $L_1$ )-smoothness and convex for Polyak stepsize (asymptotic guarantees for Polyak stepsize are given under star-convexity alone in previous works). The question is whether this is a sufficient contribution in itself.

Since I'm no expert in the field, I am willing to update my score if people in the community finds it valuable (e.g. JWAj), so I will wait for their reaction.

2024-08-12

We thank the reviewer for your reply.

[...] To me the unknown $f^\star$ case is incomplete and it would have been better to either: i) consider the deterministic case and derive rates with only a log factor following Hazan and Kakade 2019 ii) consider the stochastic case (to exclude Hazan and Kakade 2019).

Again, we thank the reviewers for your suggestion. We will add the careful discussion on Adaptive Polyak in the revised manuscript.

My understanding is still that the results are related to the Polyak stepsize and that it does not exploit the connection (correct me if I'm wrong).

$(L_0, L_1)$ -smoothness assumption was introduced to explain the practical success of gradient clipping from the theoretical perspective [1,2], and it is a unique property of clipped gradient descent that the convergence rate is $O(\frac{L_0}{T})$ under this assumption. Thus, uncovering that Polyak stepsize achieves the same convergence rate as clipped gradient descent not only shows the fact that it achieves the same rate, but also discovers the connection between Polyak stepsize and gradient clipping that Polyak stepsize combines the fruitful property of gradient clipping.

I still believe that the main contribution is to show matching rates to those of clipped GD under $(L_0, L_1)$ -smoothness and convex for Polyak stepsize [...] The question is whether this is a sufficient contribution in itself. Since I'm no expert in the field, I am willing to update my score if people in the community finds it valuable (e.g. JWAj), so I will wait for their reaction.

We appreciate the reviewer for clarifying your concerns and position. The asymptotic independence on $L$ was a unique property of clipped gradient descent. We believe that uncovering that Polyak stepsize also has the same property as clipped gradient descent is interesting and will be helpful for future research on parameter-free clipped gradient descent because this discovery allows us to study parameter-free clipped gradient descent via Polyak stepsize. We would be grateful if the reviewer could discuss the importance of this finding with other reviewers during the reviewer-AC discussion period.

Reference

[1] Zhang et. al., Improved analysis of clipping algorithms for non-convex optimization. In NeurIPS 2020

[2] Zhang et. al., Why gradient clipping accelerates training: A theoretical justification for adaptivity. In ICLR 2020

审稿意见

评分: 3置信度: 42024-06-28

The paper proposes a version of the Polyak step size when the optimal value is not known. It analyses the proposed method in the convex, $(L_0,L_1)$ -smooth setting, and draws a connection to gradient clipping.

优点

The analysis of Polyak step sizes under $(L_0,L_1)$ -smoothness seems to be novel, and the connection to clipped gradient descent is interesting.

缺点

Comparing Thm. 2 and 5, the convergence rate drops from $1/T$ to $1/\sqrt{T}$ . Comparing convergence results of Alg. 1 to those of DecSPS, AdaSPS is somewhat unfair, as those are stochastic methods, and Alg. 1 is not! Thus, Table 1 is a misleading comparison.
Having the $1/\sqrt{T}$ in the step size in Alg.1 has several disadvantages: one needs to determine the number of iterations before training, and for long runs the step size will become very small. A numerical analysis how the choice of $T$ affects the performance of the method is missing.
Algorithm 1 is a deterministic method (that is, it uses the full gradient). The comparison in section 6.2 suggests that you compare to stochastic methods (SGD,SPS,..). Which batch sizes were used for the methods? Did you use Algorithm 1 as is, or a stochastic version of it? This part of the experimental setup needs to be clarified - further, comparing deterministic to stochastic methods is somewhat problematic, as the per-iteration cost is different, and it obfuscates if any observed (dis)adavtanges are only due to the noise when running a stochastic method.
Finally, the neural network results are not really convincing, as the proposed Inexact Polyak stepsize is only competitive for one out of three settings (and the experiments are still quite small-scale compared to standard LLM benchmarks).

问题

Many plots (e.g. Figure 2 and 3) are very hard to read, due to inadequate color palettes, too thin lines, missing log scales etc.

To make the experiments reproducible, please specify for each hyperparameter and problem the actual value that was used, and not only the set of values among which it was tuned (see Tables 3-5).

局限性

See above.

作者回复

2024-08-07

We thank the reviewer for your comments and criticism of our paper.

Comparing Thm. 2 and 5, the convergence rate drops from $O(1/T)$ to $O(1/\sqrt{T})$ . Comparing convergence results of Alg. 1 to those of DecSPS, AdaSPS is somewhat unfair, as those are stochastic methods, and Alg. 1 is not! Thus, Table 1 is a misleading comparison.

In the revised manuscript, we will add the footnote in Table 1 and mention that the prior studies also established the convergence rates of DecSPS and AdaSPS in the stochastic setting. However, the convergence rates listed in Table 1 are in the deterministic setting, and the comparison of those is fair. Under the deterministic setting, Inexact Polyak Stepsize can converge to the optimal solution faster than DecSPS and AdaSPS because of $L_0 \ll L$ .

Having the $1/\sqrt{T}$ in the step size in Alg.1 has several disadvantages: one needs to determine the number of iterations before training, and for long runs the step size will become very small. A numerical analysis of how the choice of $T$ affects the performance of the method is missing.

Thank you for your suggestion. We conducted the experiments with Nano-GPT by varying the number of iterations $T$ and showed the results in Figures 5 and 6 in the attached PDF. The results indicate that Inexact Polyak Stepsize outperforms DoG, DecSPS, and AdaSPS for all $T \in \\{2500, 5000, 7500\\}$ . We have not finished experiments with SGD and Clipped SGD yet since we do not have sufficient time during this rebuttal. We will add these results along with those for SGD and Clipped SGD in the revised manuscript.

Algorithm 1 is a deterministic method (that is, it uses the full gradient). The comparison in section 6.2 suggests that you compare to stochastic methods (SGD,SPS,..). Which batch sizes were used for the methods? Did you use Algorithm 1 as is, or a stochastic version of it? [...]

We used the stochastic gradient with the same batch sizes for all methods, including our proposed method, in the results shown in Figures 2 and 3. Specifically, we set the batch size to $80$ , $64$ , and $128$ , for LSTM, Nano-GPT, and T5, respectively. For the results with the synthetic function shown in Figure 1, we used the full gradient for all methods. We will clarify these experimental settings in the revised manuscript.

Finally, the neural network results are not really convincing, as the proposed Inexact Polyak stepsize is only competitive for one out of three settings.

We would like to highlight that parameter-free methods should be featured with stability across different models and datasets, i.e., different problem-specific parameters $L$ , $L_0$ , etc., instead of achieving the best performance. For instance, as the reviewer mentioned, DoG achieved the best performance for LSTM among parameter-free methods. However, the training curve of DoG was very unstable for Nano-GPT, which is problematic for a parameter-free method. Therefore, we can conclude that Inexact Polyak Stepsize, which successfully trained all neural networks, is a desirable parameter-free method.

Many plots (e.g. Figure 2 and 3) are very hard to read, due to inadequate color palettes, too thin lines, missing log scales etc.

We will revise these figures to improve the readability.

To make the experiments reproducible, please specify for each hyperparameter and problem the actual value that was used, and not only the set of values among which it was tuned (see Tables 3-5).

Thank you for your suggestion. We will report the hyperparameters selected by grid search in the camera-ready version.

评论- Response to rebuttal

2024-08-09

Dear authors,

thank you for the clarifications, and the additional experiment.

I would not agree that Table 1 is a reasonable comparison, as the proofs for AdaSPS and DecSPS have been explicitly constructed for the (much harder) stochastic setting, while the analysis Inexact Polyak Step Size is restricted to the deterministic setting only.

Regarding the experiments, it seems to me that the (relative) performance of Inexact Polyak Step Size is very distinct for nanoGPT compared to T5 and LSTM. Do you have an intuitive explanation for this? For example, did you look at the effective step size of all methods?

2024-08-10

We appreciate your response.

I would not agree that Table 1 is a reasonable comparison, as the proofs for AdaSPS and DecSPS have been explicitly constructed for the (much harder) stochastic setting, while the analysis Inexact Polyak Step Size is restricted to the deterministic setting only.

As the reviewer mentioned, as the proofs for AdaSPS and DecSPS were constructed for the stochastic setting, there might be room to improve their convergence rates and reduce the dependence on $L$ , $\sigma$ , etc., in the deterministic setting. However, the comparison in Table 1 is consistent with the numerical results shown in Figure 1, and Figure 1 shows that Inexact Polyak Stepsize converges faster than DecSPS and AdaSPS. Thus, the comparison in Table 1 is fair in the deterministic setting, and we can conclude that Inexact Polyak Stepsize can converge faster than DecSPS and AdaSPS. In the following, we discuss the dependence of $L$ and $T$ in more detail.

Dependence on $T$ : DecSPS decreases the stepsize with rate $O(\frac{1}{\sqrt{T}})$ as $T$ increases. Thus, the convergence rate of DecSPS becomes $O(\frac{1}{\sqrt{T}})$ even in the deterministic setting. For AdaSPS, Figure 1 shows that DecSPS and AdaSPS converge at almost the same speed. This implies that the convergence rate of AdaSPS would be $O(\frac{1}{\sqrt{T}})$ in the deterministic setting.

Dependence on $L$ : Figure 1 shows that the convergence behavior of DecSPS and AdaSPS deteriorated as $L_1$ becomes large (i.e., $L$ becomes large), whereas the behavior of Inexact Polyak Stepsize was almost the same for all $L_1$ . Thus, these synthetic experiments show that the convergence rates of DecSPS and AdaSPS depend on $L$ .

From the above observation, the convergence rates of DecSPS and AdaSPS are inferred to be roughly $O(\frac{L}{\sqrt{T}})$ even in the deterministic case, which is slower than the convergence rate of Inexact Polyak Stepsize $O(\frac{L_0}{\sqrt{T}})$ shown in Theorem 5. Therefore, there is no much room to improve the convergence rates of DecSPS and AdaSPS even in the deterministic setting, and we believe that the comparison in Table 1 is reasonable.

Regarding the experiments, it seems to me that the (relative) performance of Inexact Polyak Step Size is very distinct for nanoGPT compared to T5 and LSTM. Do you have an intuitive explanation for this? For example, did you look at the effective step size of all methods?

In Figure 1, we appropriately tuned the hyperparameters. For clipped SGD and SGD, we carefully tuned the stepsize $\eta$ and clipping threshold $c$ by grid search over $\eta \in \\{1, 0.5, \cdots, 1.0 \times 10^{-4} \\}$ and $c \in \\{1, 2, \cdots, 10, \infty \\}$ (see Table 4.) The other comparison methods are parameter-free methods, which have no hyperparameters.

We have not gotten any intuition that explains why Inexact Polyak Stepsize performs so well for Nano-GPT. One potential reason why Inexact Polyak Stepsize outperformed clipped gradient descent with well-tuned hyperparameters is Inexact Polyak Stepsize can adaptively change the stepsize during the training, whereas clipped gradient descent cannot. However, this cannot explain why Inexact Polyak Stepsize works so well for Nano-GPT, compared with T5 and LSTM. There is still a gap between convergence analysis and the actual performance of training neural networks., and bridging this gap is one of the most important future research.

If the reviewer has any further concerns, we are happy to address them.

2024-08-14

We appreciate your efforts in examining our paper. The deadline for the rebuttal period is coming up soon. If the reviewer has any further questions, we are happy to resolve them.

评论- Thanks for reply

2024-08-14

Dear authors,

regarding Table 1 it seems that we have different viewpoints whether the comparison makes sense. Regarding intuition on nanoGPT performance, as you said you can currently not offer this - I would recommend to look at the effective step sizes.

Thus, I have no questions that need further discussion, but will maintain my score as my previously described concerns persist.

审稿意见

评分: 7置信度: 42024-07-13

This work made an interesting observation regarding the polyak stepsize. In particular, the main observation is that the polyak stepsize can be interpreted as doing a gradient clipping. With this observation, this work presents a new convergence guarantee of the polyak stepsize under the L_0,L_1 smoothness

优点

Interesting observation, and nice literature survey.
Theoretical results seem interesting.
Nice presentation and it's quite easy to follow.
Nice set of experiments.

缺点

I have several questions below, but overall, the main scope and results are well-presented.

问题

If you don't use the polyak step size, perhaps the parameter-free method should still achieve O(1/T) convergence rate? I'm asking this because the price to pay for parameter-freeness with the polyak stepsize seems to be quite significant. The convergence rate becomes O(1/\sqrt{T}) insteand O(1/T). Could you clarify this? Also, is there any lower bounds for this?
It's quite nice that the authors interpret the polyak stepsize as doing some kind of clipping. However, in practice, another popular method is coordinate-wise clipping. Does the polyak stepsize approach lets us do coordinate-wise clipping too?
Based on your experiments, when would you recommend practitioners to use the polyak stepsize instead of gradient clipping?
Perhaps, from a practical perspective, the fact that the algorithm needs to maintain the best iterate so far might be a little impractical. For NN training, computing f is too costly and you can only compute the mini-batch loss. This is just a minor point. How did you guys implement this for NN training experiments?

局限性

See above. Overall, I appreciate the clean presentation of this work. Based on the overall scope and the main results, my recommendation for this paper is a "weak accept."

作者回复

2024-08-07

We thank the reviewer for your positive evaluation and constructive feedback on our paper.

If you don't use the polyak step size, perhaps the parameter-free method should still achieve $O(1/T)$ convergence rate? [...]

We will add the discussion we replied to for all reviewers to the revised version and clarify this point.

Besides Polyak stepsize, DoG [2] is also a well-known parameter-free method for stepsize. [2] analyzed the convergence rate of DoG in the convex and deterministic setting, showing that DoG achieves the convergence rate $\tilde{\mathcal{O}} ( \frac{L D_T^2}{T})$ where $D_T \coloneqq \max_{0 \leq t \leq T} \| x_t - x^\star \|$ (see page 36 in [2]). Thus, it is still unclear that DoG converges to the optimal solution with rate $\mathcal{O}(\frac{1}{T})$ because $D_T$ may increase as the number of iterations $T$ increases. DoG is a parameter-free method for stepsize, whereas we study parameter-free methods for stepsize and clipping threshold, which is even more challenging than the parameter-free methods for stepsize.

We are not aware of the lower bound for parameter-free clipped gradient descent. It is one of the most promising directions for future research.

It's quite nice that the authors interpret the polyak stepsize as doing some kind of clipping. However, in practice, another popular method is coordinate-wise clipping. Does the polyak stepsize approach lets us do coordinate-wise clipping too?

Thank you for the comment. It is unclear whether Polyak stepsize approach can also be interpreted as coordinate-wise clipping as well as non-coordinate-wise clipping.

First of all, it is still an open question why coordinate-wise clipping works well in practice. [1] analyzed the convergence rate of non-coordinate-wise clipped gradient descent, showing that non-coordinate-wise gradient clipping allows gradient descent to converge faster under $(L_0, L_1)$ -smoothness. However, no prior papers have provided a similar convergence analysis for coordinate-wise clipping. Therefore, it is difficult to discuss the relationship between Polysk stepsize and coordinate-wise clipping at this moment. It is one of the promising directions to explain why coordinate-wise clipping works well from a theoretical perspective and investigate the relationship between coordinate-wise clipping and Polyak stepsize.

Based on your experiments, when would you recommend practitioners to use the polyak stepsize instead of gradient clipping?

Figure 4 indicates that Polyak stepsize with $f^\star=0$ is very unstable for LSTM and Nano-GPT. Polyak stepsize is very sensitive to the hyperparameter $f^\star$ , by comparing with the sensitivity of hyperparameters in clipped gradient descent (see Figure 3). Therefore, we do not recommend practitioners use Polyak stepsize instead of gradient clipping.

Perhaps, from a practical perspective, the fact that the algorithm needs to maintain the best iterate so far might be a little impractical. For NN training, computing f is too costly and you can only compute the mini-batch loss. This is just a minor point. How did you guys implement this for NN training experiments?

In our experiments with neural networks, we used the final parameters instead of the best parameters. Although Theorem 4 requires selecting the best parameters, we found that, in practice, it is unnecessary for neural networks because the norm of the stochastic gradient does not reach zero in practice (see Sec. 6.2.). We will revise the manuscript so that readers understand it more clearly.

Reference

[1] Koloskova et. al., Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In ICML 2023

[2] Ivgi et. al., DoG is SGD’s best friend: A parameter-free dynamic step size schedule. In ICML 2023

评论- Thanks

2024-08-09

Regarding the question "Based on your experiments, when would you recommend practitioners to use the polyak stepsize instead of gradient clipping?"

I meant your proposed method instead of just a vanilla polyak stepsize. When would you recommend using your proposed methods over gradient clipping?

2024-08-09

Thank you for your reply. Figures 3(a) and 3(c) show that Inexact Polyak Stepsize can outperform clipped gradient descent when the hyperparameters for clipped gradient descent are inappropriate. Thus, we can recommend using Inexact Polyak Stepsize when they do not have sufficient computational resources to search for the hyperparameters. However, the results indicate that Inexact Polyak Stepsize is not as good as clipped gradient descent with well-tuned hyperparameters. Therefore, if the practitioners have sufficient computational resources, we recommend using clipped gradient descent with well-tuned hyperparameters.

评论- Thanks!

2024-08-10

Thanks for answering my last question. I like the clean theoretical results, and the clear answers from your rebuttal.

I increase my score to 7. Good luck!

作者回复

2024-08-07

We thank all reviewers and AC for their efforts in reviewing our paper. We appreciate all comments and criticisms for improving our paper.

All reviewers point out that reducing $L$ to $L_0$ in Inexact Polyak Stepsize comes with the cost of slowing down the convergence rate to $\mathcal{O}(\frac{1}{\sqrt{T}})$ compared with the clipped gradient descent with appropriate hyperparameters (Theorem 2). However, we would like to emphasize that asymptotic independence of $L$ can also considerably improve the convergence rate because $L_0$ is thousands of times smaller than $L$ in practice [1]. Here, we would like to provide the summarized comment at this point. Detailed replies were provided to each reviewer separately.

Our paper studied parameter-free clipped gradient descent, and the main focus is to reduce the dependence of $L$ to $L_0$ . Since $L_0$ is much smaller than $L$ in practice [1], reducing this dependence can significantly enhance the convergence rate. To study the parameter-free clipped gradient descent, we analyzed Polyak stepsize under $(L_0, L_1)$ -smoothness, revealing that Polyak stepsize can also reduce the dependence on $L$ to $L_0$ as well as clipped gradient descent. Thanks to this discovery, we can reduce the number of hyperparameters to be considered from two (i.e., stepsize and clipping threshold) to one (i.e., $f^\star$ ). Then, we proposed a parameter-free method for Polyak stepsize, called Inexact Polyak Stepsize, without losing the asymptotic independence of $L$ .

Inexact Polyak Stepsize successfully achieves the asymptotic independence of $L$ and reduces the dependence of $L$ to $L_0$ , while it slows down the rate with respect to the number of iterations $T$ , as the reviewers pointed out. We agree with the reviewers that the convergence rate of Inexact Polyak Stepsize $O(\frac{L_0}{\sqrt{T}})$ is not optimal in terms of $T$ , and there may be room to improve this rate. However, we believe our discovery and the proposed method are the important first steps for developing parameter-free clipped gradient descent. Especially, uncovering that the convergence rates of clipped gradient descent and Polyak stepsize are the same allows us to reduce the number of hyperparameters to be considered from two to one, which will be pretty useful for future studies. We are grateful that all the reviewers found this discovery novel and interesting. We will carefully discuss the above point in the revised manuscript.

Again, we thank the reviewers for examining our paper. We would like to respond to any of your comments if you have any concerns about our feedback. We value your feedback and welcome discussion.

Reference

[1] Zhang et. al., Improved analysis of clipping algorithms for non-convex optimization. In NeurIPS 2020

最终决定Accept (poster)

2024-09-25

This is a solid paper, interesting and relevant to the community. The new inexact Polyak step size suffers from a 1/sqrt(T) rate but its convergence rate holds without knowledge of f*. We recommend acceptance as a poster.

To authors: Following the discussion with Reviewer bj1n, make sure the comparison in Table 1 is as fair and explicit as possible regarding stochastic vs. deterministic regime. One way would be to add a column "Regime" with values "Stochastic" or "Deterministic" depending on the analysis. Another way would be to delete Table 1 altogether. Please also improve the color palette in the plots, as the yellow color is really hard to read.