4.8

/10

withdrawn4 位审稿人

最低3最高8标准差2.0

4.3

置信度

ICLR 2024

Provable Benefit of Adaptivity in Adam

Bohan Wang,Yushun Zhang,Huishuai Zhang,Qi Meng,Ruoyu Sun,Zhi-Ming Ma,Tie-Yan Liu,Zhi-Quan Luo,Wei Chen

OpenReview PDF

提交: 2023-09-23更新: 2024-03-26

TL;DR

We show that Adam converges under non-uniform smoothness.

摘要

The Adaptive Moment Estimation (Adam) algorithm is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the L-smooth condition. Unfortunately, this assumption does not hold for many deep learning tasks. Moreover, we believe that this assumption obscures the true benefit of Adam, as the algorithm can adapt its update magnitude according to local smoothness. This important feature of Adam becomes irrelevant when assuming globally bounded smoothness. In this paper, we present the first convergence analysis of Adam without the bounded smoothness assumption. We demonstrate that Adam can maintain its convergence properties when smoothness is linearly bounded by the gradient norm, referred to as the $(L_0, L_1)$-smooth condition. Further, under the same setting, we refine the existing lower bound of SGD and show that SGD can be arbitrarily slower than Adam. To our knowledge, this is the first time that Adam and SGD are rigorously compared in the same setting where the advantage of Adam can be revealed. Our theoretical results shed new light on the advantage of Adam over SGD.

关键词

Adamconvergencenon-uniform smoothness

评审与讨论

审稿意见

评分: 3置信度: 42023-10-31

This paper provides the convergence analysis of Adam under ( $L_0,L_1$ )-smooth condition.

优点

Beyond the $L$ -smooth condition, the authors analyze the convergence of Adam under $L_0,L_1$ )-smooth condition. Their analysis yields a specific convergence rate (to small error).
Moroever, the authors also provide a function on which Adam works but SGD fails.

缺点

Several assumptions in this paper differ significantly from standard analysis. Notably, in Assumption 3.2, the constant $D_1$ should be dependent on the parameter $n$ , and $D_1\to\infty$ when $n\to\infty$ . According to the main result in Theorem 4.1, the RHS is on the order of $\sqrt{D_1}$ , which falis for large $n$ , a critical regime in stocastic optimization.
While the authors provide an example that Adam works but SGD fails, there may exist more examples under which Adam fails but SGD works. The results in this paper do not provide insights into the conditions under which Adam is likely to fail, as well as the efficiency of Adam in deep learning.
The title suggests a focus on the benefits of "Adaptivity," but this paper lacks a thorough explanation regarding these advantages.

问题

Please see "Weaknesses".

评论- Response

2023-11-21

审稿意见

评分: 8置信度: 42023-11-01

In this paper, the authors studied the theoretical properties of Adam optimizer. For the existing theoretical analysis, they rely on the bounded smoothness assumption (L-smooth condition), which is not practical in real-world machine learning models. Therefore, the authors proposed the very first convergence result of Adam without assuming the L-smoothness. The L-smoothness is replaced by a (L_0, L_1)-smooth condition, where the norm of Hessian matrix can be upper bounded by a linear combination of constant and gradient norm.

优点

The writing of this paper is very clear. The theoretical proofs are sound and solid. Although I didn't go through all the proof details in the appendix, I believe all these proofs are theoretically correct.
The reference list of this paper is also very complete.
The biggest strength is that the authors empirically verified the (L_0, L_1)-smooth condition under the case of LSTM and Transformers, which are both widely used model architecture.

缺点

The authors should pay more attention on the details of paper writing. For example, in the last line of Page 2, there is a reference error (Appendix ??).

问题

For me, there are no additional questions and suggestions. I think this paper worth an "accept" because of its theoretical soundness.

评论- Response

2023-11-21

审稿意见

评分: 5置信度: 42023-11-02

This paper studies convergence of the Adam algorithm. Previous work shows that Adam can converge to a bounded region around a stationary point under L-smooth condition. However, L-smoothness does not hold in practical settings (e.g., training of DNNs), so the main focus in this paper is to weaken the assumption to $(L_0, L_1)$ -smoothness. The authors show that Adam can converge to a bounded region even under this weaker assumption. Moreover, they also demonstrate the Adam's benefit over SGD by comparing the upper bound of Adam's convergence rate with the lower bound of the SGD's.

优点

They prove the convergence of Adam under the $(L_0, L_1)$ -smoothness condition, which is novel.
Benefits of Adam over SGD is explained via the derived convergence bound.

缺点

Theoretical contribution is a little minor. Basically, the result is an extension of [1]. Analysis under the $(L_0, L_1)$ -smooth assumption is a novel point, but technical novelty of their proof is limited.
They do not take the bias correction technique of Adam into consideration, which is widely used in practice. I think the result will not be so different even when the bias correction is incorporated, but they should mention it at least.
They mention that the first term on the right hand side of Eq. (4) is $\Theta ( 1 / (1 - \beta_2)^2 )$ . This means that the convergence of Adam deteriorates when $\beta_2$ is chosen to be close to $1$ , which seems to contradict to practical observations. They should add discussion on it.

[1] Zhang, Yushun, et al. "Adam can converge without any modification on update rules." Advances in Neural Information Processing Systems 35 (2022): 28386-28399.

问题

As I pointed out in the Weakness section, the first term on the right hand side of Eq. (4) is $\Theta ( 1 / (1 - \beta_2)^2 )$ , which means that the convergence of Adam deteriorates when $\beta_2$ is chosen to be close to $1$ , which seems to contradict to practical observations. How can the gap between theory and practice be explained?

评论- Response

2023-11-21

审稿意见

评分: 3置信度: 52023-11-06

This manuscript provides the convergence analysis for randomly shuffled Adam under the $(L_0,L_1)$ -smooth condition and stochastic gradient noise with affine variance. The authors also refine the existing lower bounds of SGD by following existing work [Zhang et al., 2019a].

优点

The paper studied an interesting and important problem. Understanding Adam in deep learning is very important.

缺点

One of the main theorems (Theorem 4.1) is very weak.

(i) It says that unless $D_0=0$ , the convergence rate of Adam is worse than SGD. One has to choose very large $\beta_2$ to converge to $\epsilon$ -stationary points. As the second term in RHS of formula (4) is of order $1/(1-\beta_2)^2$ , and one has to choose $(1-\beta)^2=\epsilon^2$ to make the second term of (4) on the RHS to be small, then $\beta_2$ has to be $1-\epsilon$ , and hence $T$ has to be $O(\epsilon^{-8})$ to make the first term of RHS of (4) to be less than $\epsilon^2$ . This is much worse than the complexity of SGD. In this setting, the SGD complexity indeed depends on the magnitude of gradient on a sublevel set, but this quantity can be bounded by smoothness constants and independent of $\epsilon$ . The $\epsilon$ dependency of SGD is still $\epsilon^{-4}$ . Therefore this contradicts the main claim of this paper about the "provable benefit of adaptivity in Adam".

(ii) The Theorem 4.1 requires $O(n\epsilon^{-8})$ gradient calls to find $\epsilon$ -stationary points, because the RR-Adam algorithm described in Algorithm 1 denotes $k=1,\ldots,T$ as $T$ as epochs, not $T$ iterations. The hidden $n$ dependency makes this paper extremely weak. For example, there is a simple baseline that significantly dominates RR-Adam analysis in this paper: one can run gradient descent with clipping as in [Zhang et al. 2019a] in a deterministic case (e.g., every epoch), then the number of gradient oracle calls is at most $O(n\epsilon^{-2}$ ) (Theorem 3 in Zhang et al. 2019a).

(iii) The provable benefit of Adam is proved in [Li et al. 2023] and they prove the $O(\epsilon^{-4})$ gradient complexity of Adam in the setting of stochastic optimization, which I believe is more general than the finite-sum setting as considered in this manuscript and they achieved a much better complexity results.

The presentation of the proof of Theorem 4.1 is very difficult to read. There are too many "constants" (e.g., from $C_1$ to $C_{13}$ on Page 18) which depend on each other. I suggest the authors to highlight which term is dominating and what are their orders.
The lower bound results are standard extensions of existing lower bounds in [Zhang et al. 2019a, Crawshaw et al. 2022]. As I saw from the proof, the hard instances are the same and the proof idea is also closely following these prior works. The only main difference is concatenating two 1-dimensional instances from previous works to a single 2-dimensional instance. I think this is a very marginal contribution.

Overall, although this paper studied an interesting problem, the upper bound results are too weak and dominated by previous works, and the lower bound results come from straightforward extensions from previous works. Therefore I recommend rejecting this paper.

问题

Can the authors address my questions? I am happy to increase my score if my concerns can be addressed.

伦理问题详情

N/A

评论- Response

2023-11-21

We express our appreciation to the reviewer for your constructive feedback. We will thoroughly revise our paper based on these valuable suggestions. However, due to the brevity of the rebuttal period, we are unable to complete this process in time and have decided to withdraw the paper. We would like to address a point of contention regarding the comparison of our work with that of [Li et al., 2023], which we believe to be both unfair and inappropriate.

Firstly, our draft was submitted to flagship conferences nearly a year before [Li et al., 2023] made their submission to arXiv, and our work was even cited by them. We believe that drawing a comparison under such circumstances is not correct.
Secondly, the analysis presented in our paper holds true when $\lambda = 0$ . Here, $\lambda$ represents a term added to $\sqrt{\nu_t}+\lambda$ to improve numerical stability and can be as minimal as $10^{-8}$ . Contrarily, Li et al., [2023] operates under the assumption that $\lambda>0$ , which creates a vastly different framework. This assumption allows the adaptive learning rate to be instantly upper-bounded, significantly simplifying the analysis. Conversely, in our proof, deriving the upper bound of $\sqrt{\nu_t}$ presents a central and complex challenge. The bounds provided by [Li et al., 2023] show a polynomial dependence on $1/\lambda$ , which could potentially result in an extremely loose bound. We acknowledge and appreciate the contribution of [Li et al., 2023] in providing an $O(1/\epsilon4)$ rate in the stochastic setting. However, due to the stark differences in setup, as discussed above, comparing our work with theirs is not even meaningful.