A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD
A theoretical paper on the convergence of Adam
摘要
评审与讨论
The authors propose a new theoretical framework with fairly weak assumptions, within which they are able to establish convergence rates for Adam.
给作者的问题
No question.
论据与证据
N/A
方法与评估标准
No simulation sutdy nor application on real datasets.
理论论述
Given the short time available, it was not possible to fully and rigorously review all the proofs. However, based on what I was able to check, the proofs appear to be correct.
实验设计与分析
N/A
补充材料
N/A
与现有文献的关系
The main contribution rely on a new set of weak assumptions to obtain theoretical results for ADAM algorithm. More precisely, they obtain analogous rate of convergence as in the literature, but with weaker conditions.
遗漏的重要参考文献
N/A
其他优缺点
Strengths: The authors successfully obtain results for Adam under remarkably weak assumptions (smoothness and ABC inequality). Given the short time available, it was not possible to fully and rigorously review all the proofs. However, based on what I was able to check, the proofs appear to be correct and well-detailed. I also appreciate the effort made to enhance the readability of the proofs, particularly through the use of the dependency graph.
Weaknesses: While it is now widely accepted that simulations are not strictly necessary to demonstrate that Adam works, it would have been valuable to present an application example that could not have been theoretically addressed by previous works but can now be handled before moving on to simulations.
其他意见或建议
No comments or suggestions.
Dear Reviewer LTUb,
We sincerely appreciate your thorough evaluation of our manuscript and your positive feedback. Your recognition of our theoretical framework and the establishment of convergence rates for Adam under weak assumptions is highly encouraging.
We acknowledge your suggestion to include an application example demonstrating the practical implications of our theoretical findings. While our primary focus has been on the theoretical aspects, we understand the value of illustrating how our results can address scenarios previously unmanageable by earlier works. In response, we plan to incorporate a relevant application example in our revised manuscript to highlight the practical applicability of our theoretical contributions.
Thank you once again for your insightful comments and for your recommendation to accept our work. Your feedback has been instrumental in enhancing the quality and impact of our manuscript.
Sincerely,
Authors of Paper 1314
The paper studies the convergence properties of Adam under smooth nonconvex settings. The paper presents convergence results in the sense of almost sure, and non-asymptotic, under relaxed noise assumption, i.e. the ABC inequality. The non-asymptotic convergence result is in the order of , which is generally consistent with that of SGD.
给作者的问题
-
What is the dependence on dimensionality and in Theorem 3.1? Could you provide a formal version?
-
If you do have additional dimensionality dependence, is it possible to extend your results to some other smoothness settings as in [1,2,3], which can potentially remove the additional dependence and fill this gap between SGD and Adam?
-
Can your proof also extend to a more general smoothness case, e.g. -smoothness?
[1] Bernstein J, Wang Y X, Azizzadenesheli K, et al. signSGD: Compressed optimisation for non-convex problems. International Conference on Machine Learning. PMLR, 2018: 560-569.
[2] Liu Y, Pan R, Zhang T. AdaGrad under Anisotropic Smoothness. arXiv preprint arXiv:2406.15244, 2024.
[3] Xie S, Mohamadi M A, Li Z. Adam Exploits -geometry of Loss Landscape via Coordinate-wise Adaptivity. arXiv preprint arXiv:2410.08198, 2024.
论据与证据
Most of the claims are generally clear.
- I do have a question for Theorem 3.1. According to Theorem 3.1, the notation doesn't omit the dpendence on dimensionality . I am wondering if this is possible with your assumptions? Or you just simply miss this?
- Also regarding in Theorem 3.1, could you please also specify the dependence on ?
方法与评估标准
No experiments in the paper.
理论论述
I didn't check the whole proof for the theoretical claims due to its complexity. Basically the results are reasonable.
实验设计与分析
No experiments in the paper.
补充材料
No.
与现有文献的关系
This paper moves the convergence results for Adam, which is popular in the literature, a step forward.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The paper is technically solid, making a progress in obtaining better convergence results for Adam under more relaxed settings.
Weakness:
- I don't think hiding so much details of the convergence result is appropriate. As I mentioned in Claims and Evidence part, I think it's quite likely that the authors hide the dependence on dimensionality (which can be very large in practice). Also, the dependence on and and others are also important, as this can imply the role of momentum for Adam as well as some possible suggestions for the parameter choices. Thus I think the authors should definitely give a formal statement of Theorem 3.1 at least in the appendix.
- The choice for seem to be kind of restricted.
其他意见或建议
-
For Theorem 3.1, it seems kind of weird to state the convergence result in high-probability form, since it has a dependence on the probability as , while standard high-probability convergence results are usually in the order of . I think just equation (42) is good enough for the statement.
-
I suggest the authors to distribute some space at least in the appendix to aggregate the definitions for the values. It's really hard to follow the proof or even find out the detailed results of the theorems with a lot of defined values like with very seperate definitions.
-
Why you are not using the form with numbers noting the lines?
Rebuttal to Reviewer nypN
Dear Reviewer nypN,
Thank you very much for your thoughtful feedback and constructive comments on our manuscript. We sincerely appreciate the time and effort you have put into reviewing our work. We are grateful for your insights, and we have carefully addressed each of your points below in the hope of clarifying our contributions and improving the manuscript.
1. Dependence on Dimensionality and in Theorem 3.1
You raised a question regarding the dependence on dimensionality and in Theorem 3.1. Specifically, you asked about the formal version of the sample complexity result in the theorem.
Response: In Theorem 3.1, the sample complexity result concerning and the dimension is of the order . This result is consistent with previous works on the convergence of Adam, such as [1]. We would like to emphasize that while it is possible to remove the dependence on the dimension , we cannot avoid reintroducing the dependence on the inverse of the smoothing factor, which is . This is a well-known consensus in previous studies [2]. We hope this clarification addresses your concern.
2. Extension of Results to Other Smoothness Settings
You asked whether our results could be extended to other smoothness settings, as in [3, 4, 5], and whether this could potentially remove the additional dependence on dimensionality, helping to bridge the gap between SGD and Adam.
Response: We believe that it is highly probable to extend our results to other smoothness settings, and we are excited about exploring this in future work. However, we must admit that this area was not covered in our previous research, and therefore, we cannot provide a definitive answer at this moment. Nonetheless, we plan to address this issue in our future research, where we will explore the possibility of extending our results to more general smoothness assumptions and examine whether the additional dependence on dimensionality can be removed.
3. General Smoothness Cases (e.g., Smoothness)
You inquired whether our proof can extend to more general smoothness cases, such as smoothness.
Response: We are currently investigating this direction and have made some progress. Specifically, we have made the following two extensions so far:
-
For smooth functions: We can derive convergence results for Adam under the traditional second-moment-based ABC inequality. However, the sample complexity still shows dependence on the inverse of the smoothing factor , though we can eliminate the dependence on the dimension .
-
For smooth functions: At this stage, our methods are unable to extend the convergence results under the traditional second-moment-based ABC inequality. However, we can obtain convergence results under the traditional second-moment Bounded Variance condition. It is worth noting that, as of now, there are no known results for Adam’s convergence under smoothness with the traditional second-moment Bounded Variance condition.
We are continuing to investigate these extensions and will include them in future work. We greatly appreciate your interest in this aspect and will strive to address it in subsequent studies.
4. Additional Comments and Acknowledgements
We sincerely appreciate your review, which has helped us identify areas for clarification and potential improvement. We value your suggestions and will ensure that these aspects are thoroughly explored in our future research. Your feedback has significantly contributed to the refinement of our manuscript, and we hope that the revisions we have made have addressed your concerns effectively.
If you have any further questions or suggestions, please do not hesitate to reach out. We are more than happy to discuss any aspects of our work in greater detail. Once again, thank you for your careful review and constructive feedback.
We look forward to your final assessment.
[1] Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap be- tween the upper bound and lower bound of Adam’s iteration complexity. Advances in Neural Information Processing Systems, 36, 2024a.
[2] Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of Adam under relaxed assump- tions. Advances in Neural Information Processing Systems, 36, 2024.
[3] Bernstein J, Wang Y X, Azizzadenesheli K, et al. signSGD: Compressed optimisation for non-convex problems. International Conference on Machine Learning. PMLR, 2018: 560-569.
[4] Liu Y, Pan R, Zhang T. AdaGrad under Anisotropic Smoothness. arXiv preprint arXiv:2406.15244, 2024.
[5] Xie S, Mohamadi M A, Li Z. Adam Exploits -geometry of Loss Landscape via Coordinate-wise Adaptivity. arXiv preprint arXiv:2410.08198, 2024.
Sincerely,
Authors of Paper 1314
Thanks for the detailed reply, which basically addressed my questions. It's good to hear that the extensions are generally possible. I understand the authors' point on dependence on dimension and momentum factor , but I still want to emphasize why I think it should be explicitly shown in the statements here.
- Adam is widely used in large-scale experiments, which means that can be extremely large. The explicit dependence on suggests that the convergence rate is actually not desirable for large-scale experiments. I understand that previous results do have the additional dependence on as well, but I disagree with what you refer to as a "well-known consensus" by [2]. I don't think they have proof for your claim, i.e., you have to bear this additional explicit dependence on or for Adam. If you introduce to the convergence rate, it intuitively encourages us to select large , and the algorithm turns out to be more similar to SGD. If you think this is the only way to eliminate the explicit dependence on , then why don't we directly use SGD? Why is Adam so popular in practice?
- means the incorporation of momentum. Since your result depends on , it seems that basically choosing results in the best rate. This is not the case in practice, right?
Anyway, I agree with the authors' contribution on the technical side and fully understand that these points are not considered by some existing work as well, but I still want to emphasize these points as somehow remaining problems of the results that might be improved in the future. For now, I think the paper is qualified, and I would keep my score since it's already positive.
Dear Reviewer nypN,
Thank you very much for your positive feedback and detailed comments. We appreciate your support and valuable suggestions, which are very helpful for improving our work.
Best regards,
Authors of Paper 1314
In the past several years, many efforts have been made to understand the convergence of Adam-like algorithms under different noise assumptions. This paper is a novel paper among these works and is based on an even weaker version of the noise condition called the ABC condition. Under the ABC assumption, the authors provide a non-asymptotic rate of for sample complexity, independent from the smoothing factor . Additionally, they demonstrate an asymptotic convergence and of the gradient norm to zero in both the sense of almost sure and expectation. These results match the best rates so far with a weaker condition and advance the theoretical understanding of adaptive methods.
给作者的问题
No further questions.
论据与证据
Not applicable.
方法与评估标准
Not applicable.
理论论述
The major theoretical conclusions are reasonable, and the key steps in the proofs appear to be correct.
实验设计与分析
Not applicable.
补充材料
I have briefly checked the proof in the appendix. I am not sure about the proof details but they appear to be convincing.
与现有文献的关系
This work is entirely theoretical and does not present any negative broader scientific or societal impacts. The relationship to closely related work is discussed in the "weaknesses" part below.
遗漏的重要参考文献
The authors appropriately cite the most relevant prior work and provide a clear and detailed discussion of how their contributions relate to and advance the existing literature.
其他优缺点
Strength
This paper is generally well-written and easy-to-follow. It gives a clear comparison of the assumptions with closely related work and makes it easy for the reader to understand. From a contribution perspective, compared with prior results, this paper achieves a nearly matching rate and establishes asymptotic convergence for Adam under a weaker ABC condition by leveraging advanced tools from functional analysis. This is valuable to the optimization community.
Weakness
However, the novelty of this work is somehow questionable. While the results in this paper indeed rely on a weaker assumption compared to prior work, the gap between the ABC condition and the affine variance condition or e is not substantial. As a result, the findings, though technically sound, are not entirely surprising. It would be more helpful if the authors could provide more convincing arguments showing that their technique is indeed novel from existing work, particularly [Hong and Lin, 2024]. I will change my score if they clarify how their approach is fundamentally different from existing methods.
其他意见或建议
Not applicable.
Dear Reviewer JhWZ,
Thank you for your thoughtful and constructive feedback on our paper. We greatly appreciate the time and effort you’ve put into reviewing our work. We carefully considered your comments, and we would like to address the main concern regarding the differences between our approach and the paper by Hong and Lin (2024), as well as clarify the analytical techniques and assumptions used in our paper.
In the paper by Hong and Lin (2024), the authors introduce an affine noise variance assumption that differs from the traditional second-moment-based affine noise variance assumption:
Instead, they strengthen the concentration property of the random variables by assuming the following (their Assumption A.3) :
\mathbb{E}\left[\exp\left\\{\frac{\\|g_{t}-\nabla f(w_{t})\\|^{2}}{B\\|\nabla f(w_{t})\\|^{2+\epsilon}+C}\right\\}|F_{t-1}\right] \leq e.Their proof is heavily dependent on this condition, making their approach inapplicable for analyzing the traditional second-moment-based affine variance noise conditions. For a detailed explanation, please refer to their open-access paper: Hong and Lin, 2024.
Hong and Lin's proof relies extensively on Lemma B.6, which is proven using concentration inequalities (shown in their Appendix D.2). Under the Exponential-tailed Affine Variance Noise Condition, these inequalities yield an order factor for the sample complexity term . However, when considering traditional affine variance noise conditions based solely on second-order moment assumptions, only an factor can be derived for , which may not yield the desired results.
Our paper employs the ABC inequality, which is even weaker than traditional affine variance noise conditions. This necessitates fundamentally different analytical methods, particularly based on discrete Martingale analysis, distinguishing our approach from that of Hong and Lin (2024).
Thank you again for your time and thoughtful evaluation of our work. If you have any further questions, please don't hesitate to discuss them with us.
Sincerely,
Authors of Paper 1314
This paper presents a unified analytical framework for understanding Adam’s convergence under weaker assumptions than those typically used. Specifically, the authors rely on standard L-smoothness and ABC inequality for stochastic gradients to show that Adam achieves non-asymptotic and asymptotic convergence.
给作者的问题
In my understanding, the behavior of Adam is quite different from SGD. How the author bridging the gap with SGD in there paper? Only in the part of establishing descent inequality and the final results?
论据与证据
The main claim of this paper is the convergence of Adam with not very strong conditions. The claims are supported by rigorous math proofs.
方法与评估标准
This is a theoretical paper so is not applicable to this question.
理论论述
I follow the proof sketch and check some proofs of main lemmas. They seemed to be correct. However, the authors discuss more about the the assumptions, especially for ABC inequality since it is not standard in the analysis. The authors can highlight how to apply this assumption in the proof and also explain briefly why we don't need previous strong assumptions. These can provide more theoretical insights.
实验设计与分析
This is a theoretical paper so is not applicable to this question.
补充材料
The appendix is well organized. Section A of appendix compares the gradient assumption with previous ones. However, the relatively weak assumption used in this paper is a main difference comparing with previous work so there should be a short paragraph discussing this difference in the main body of the paper, instead of in the appendix. They can also discuss the parallel results in SGD analysis using similar assumptions.
与现有文献的关系
Understanding Adam is important since it achieves great success in LLM training.
遗漏的重要参考文献
The authors do cite key references on Adam’s analysis under different assumptions.
其他优缺点
This is a solid theoretical paper analyzing Adam and achieves good results in both non-asymptotic and asymptotic settings. My concern is mainly on the presentation of the results and is already pointed out in previous questions. One more concern is the definition of hyperparameter converging to 1 looks unnatural to me.
其他意见或建议
No other comments.
Dear Reviewer,
We sincerely appreciate your thorough evaluation of our manuscript and your insightful feedback. Your recognition of our theoretical framework is highly encouraging.
Incorporation of Assumption Comparisons into the Main Text:
We acknowledge your suggestion to move the discussion comparing our assumptions, particularly the ABC inequality, with previous ones from the appendix to the main body of the paper. We agree that this adjustment will enhance the clarity and accessibility of our work. In the revised manuscript, we will integrate this discussion into the main text, providing a concise comparison and highlighting the theoretical insights gained from using the ABC inequality. Additionally, we will discuss parallel results in stochastic gradient descent (SGD) analyses that employ similar assumptions to further contextualize our contributions.
Clarification on the Behavior of Adam When Does Not Approach 1:
Regarding your concern about the definition of the hyperparameter converging to 1, we appreciate the opportunity to clarify this point. In scenarios where does not approach 1, Adam's convergence behavior differs. Specifically, under such conditions, Adam may only ensure that the gradient converges to a small neighborhood around zero rather than exactly to zero. To achieve convergence of the gradient to zero, it is necessary for to approach 1. This requirement has been highlighted in previous studies, such as the work by [1] on the convergence of Adam. In our current paper, we focused on aligning Adam's convergence results with those of SGD, which led us to adopt the condition where approaches 1. We acknowledge that this aspect was not explicitly discussed in our manuscript, and we will address this omission in future research by exploring scenarios where does not approach 1.
Thank you once again for your valuable comments and suggestions. Your feedback has been instrumental in improving the clarity and depth of our work.
[1] Zhang Y, Chen C, Shi N, et al. Adam can converge without any modification on update rules[J]. Advances in neural information processing systems, 2022, 35: 28386-28399.
Sincerely,
Authors of Paper 1314
Thanks for the detailed reply from the authors, which resolves my concerns. I will remain the score.
This paper provides a comprehensive framework for analyzing the convergence of Adam. The reviewers are generally positive about the paper after the rebuttal. Therefore, I recommend acceptance. The authors are required to incorporate the following points raised by reviewers in the final version of the paper: (i) explicitly mention the dependency on dimension in the theorem statement and the paper; (ii) explain the ABC inequality and compare with other assumptions in the literature; (iii) explicitly acknowledge that the paper does not show a separation between Adam and SGD.