Adam with model exponential moving average is effective for nonconvex optimization
摘要
评审与讨论
In this paper, the authors analyze the online learning framework 'discounted-to-nonconvex conversion', and propose a bound for the expectation of the exponential average of the output gradient. Moreover, they find that if the online learner follows a specific loss structure(FTRL), then the output of this online learner would follow a clip-Adam rules.
优点
The result of this paper is solid.
缺点
Since this paper is built upon the 'discounted-to-nonconvex conversion' framework proposed in [1], focusing on the perspective of online learning, many definitions, concepts, and details require further elaborations and interpretations, especially for readers who are unfamiliar with this topic. I will list all my confusions and questions in the 'Questions' part and update my rating according to the response from the authors.
[1] Qinzi Zhang and Ashok Cutkosky. Random scaling and momentum for non-smooth non-convex optimization. In International Conference on Machine Learning. PMLR, 2024.
问题
-
In line 109, the description of the Algorithm 1, could the author explain the reason they introduce the notations and , since from my perspective, represent the iterations of coefficients of the loss function. Then the stochastic gradient should also be computed by , instead of . I also wonder about the implication of in the definition of , since it only benefits the further proof.
-
I might be wrong, but there might be typos in the formula of Definition 6. According to the definition of in Algorithm 1, the RHS of formula might be . Meanwhile, I also wonder whether is actually \mathbf{g_s} defined in Algorithm 1 since I never find the definition of . Moreover, could the author explain the meanings and implications of such a definition?
-
Could the author explain the meaning of choosing such a comparator in Lemma 7.
-
I might be wrong, but as I checked the Lagrangian function of constrained optimization in line 137, I guess the RHS should be .
-
Could the authors explain the details of how they get Corollary 12 from the results of the definition in Theorem 11.
-
Could the authors explain why line 466 holds since I'm not sure about the independence among and . Moreover, could the author explain about the randomness in their formulas and which variables they take expectation w.r.t during the proof?
-
Could the author further discuss the implications of their main result? Since from my perspective, it looks like they use the exponential discounted loss and choose FTRL learner in Discounted-to-nonconvex conversion, then they could get an updating rules share a similar structure with Adam. However, this result is far-fetched, and I have no idea what it implies.
局限性
No limitations.
Thank you for your detailed comments! We answer your questions one by one below.
-
You are correct that is computed at in practice. In our new version, we actually fix this issue, and in our new main results, is computed at , and there is no need to introduce anymore. This is based on the "exponential scaling" technique due to [Zhang and Cutkosky, 2024], and we will update this in our final version.
-
The role of is for the identity to hold, which is crucial for the discounted-to-nonconvex conversion to work. When is convex, this holds without the random scaling , but it turns out for the nonconvex setting, we need this random scaling. In our final version, where we fix the afornmentioned issue regarding , we choose to be an exponential random variable, inspired by [Zhang and Cutkosky, 2024].
-
Regarding being beneficial for the further proofs, sorry it's an overload of notations; the 's that appears in the later part of the analysis should've been a separate quantity. We changed 's in the later part to in our final version.
-
-
Thank you for catching the typos. It should be ineed . We corrected it in our final version.
-
The comparator roughly models the "update direction an oracle algorithm with perfect knowledge of the loss would have chosen." In the proof of Lemma 7, we show that along such a direction, the loss value decreases, and that is how we could come up with the convergence guarantee.
-
You are correct. Thanks for catching the typo.
-
Let's start with in Theorem 11. SInce and , it follows that as desired.
-
Thank you for your question. We will detail the argument in our final version. In our theorem statements, the expectations are taken over both randomness in the gradient oracle as well as the internal randomness of the algorithm from choosing . Note that the randomness in the stochastic gradient oracle is independent of the randomness due to . Since , it follows that where the inner expectation is with respect to the randomness in the stochastic gradient oracle and the outer is with respect to all other quantities. In general, we will detail all the arguments regarding expectations in our final version.
-
As discussed in the introduction section, Adam with EMA is very succesful for large complex neural network models. Our main goal was to understand why such a method is very effective. Although we do have a minor modification to the original Adam (such as clipping), our analysis shows that Adam with EMA achieves an optimal convergence rate, and it is particularly benefical when the Lipshitzness constants are unknown and heterogenous across different coordinates. We believe our results to be a significant progress towards understanding the success of Adam (with EMA) since before our work, it is unclear from the theoretical standpoint as to why Adam is much more effective than SGD.
We would be glad to address any remaining concern!
I thank the authors for their response. I still have some resolved questions. Since the author claimed they had dropped the notation of , which plays a significant role in their original proof, I can not check the correctness of their results in their new version. Despite this significant problem, I also feel confused about their explanations for the following points:
- Since they drop the variable , why they still need for their further proof? At least in their original version, was introduced in the definition of , and now, I even don't know where is this first introduced.
- The authors might misunderstand my concerns about line 466. Actually, I want to know why .
- The authors's explanation of implications and motivations does not convince me, several definitions or settings, seems too far-fetched and specific. While the authors consider clipping a minor modification, I have never used it or seen it used by others in practice. Similar questions also exist for the definition of comparator and learning rate . As I stated in my review, it seems only given such specific forms, the authors get an updating rules share a similar structure with Adam. However, I can not see any direct connections between these frameworks and Adam.
Due to these questions, I have to keep my original rating score. However, I want to stress that I'm not familiar with online optimization. Specifically, my comments about these intermediate variables might be unfair and it's welcome to point it out with more convincing evidence. Moreover, I keep my confidence at 2 and if other reviewers feel more confident in their judgment, I would defer to them.
-
To clarify the new scheme, the update is now , where is sampled from an exponential distribution (which is a very light-tailed distribution with mean 1 and variance 1). So is still used.
Via an elementary argument involving fundamental theorem of calculus and the pdf , we obtain .
This identity says that the function gap is exactly equal to the linearization of the function gap. In a sense, the randomization makes the first-order Taylor approximation perfectly correct. In the current submission, we sample at to ensure this identity. However, changing to be exponentially distributed and multiplying the update by means that this is no longer necessary.
In short, although plays a significant role, the entirety of that role is contained in establishing the identity . This other option keeps the same identity and so works just as well.
That said, if the reviewers prefer to keep the results the same as in the submitted version for which all of the analysis is completely available, we are of course happy to do so.
-
Regarding why : this is is because is a standard stochastic gradient oracle evaluated at and as such . As a concrete illustration, suppose that , where is the model parameter, is a randomly sampled data point, and is the loss of the model on the data point. Then can be taken to be for some i.i.d. sample . This formalism exactly captures the standard approach used in training. In this case, since is independent of and , we have .
-
Regarding the clipping: You're right that this is not too common in current practice. However, we feel this is more an analytical artifact than a real change in the algorithm because the clipping should not be active in most iterations. The reason for this is that intuitively as the algorithm convergence we expect the gradients to be dominated by noise so that they look roughly like mean-zero random variables. In this case, the clipping will not occur. On the other hand, if the clipping does occur frequently, then an alternative analysis based upon https://jmlr.org/papers/v18/17-079.html would actually show that we converge incredibly fast. This in turn would put us in the regime in which gradients are dominated by noise, and so clipping would stop occurring.
The authors show that with clipping and model exponential average, (Clipping) Adam can perform better than SGD.
优点
Give the analysis of (Clipping) Adam with model exponential average.
缺点
-
The paper is hard to follow.
(i) In Algorithm 1, it would be better to add "output: "
(ii) it is unclear what is in Definition 6.
-
All of the following results are based on Lemma 7, I am not sure whether Lemma 7 is suitable for SGD.
问题
If change Lemma 7 with a different formulation will the conclusion change?
局限性
N/A
Thanks for your comments. Please see the other responses for some clarifications to the mathematics that we will include in the final paper. We hope this will make the paper easier to follow.
Regarding Lemma 7: it is indeed an important part of our analysis, and as we show it does indeed allow us to analyze Adam - in fact it arguably allows us to derive Adam from first principles. A similar argument using a less advanced online learning algorithm could be made to analyze SGD with momentum instead.
Thank you for the response I have raised my score.
This paper proposes a variant of Adam involving per-iteration clipping of the updates (which look like Adam's updates) and EMA-based weight averaging. More specifically, two versions of this variant are proposed -- one is a global (w.r.t. all the coordinates) version and one is a per-coordinate version. The two main technical tools used in developing the proposed algorithm are (i) the online-to-nonconvex framework of references Cutkosky et al. [2023] and Zhang and Cutkosky [2024] wherein the update directions are chosen with an online learner, and (ii) choosing the online learner to be scale-free Follow-the-Regularized-Leader (FTRL) based on reference Ahn et al. [2024b]. The proposed algorithm attains the optimal convergence rate for both smooth and non-smooth nonconvex settings (although w.r.t. a different notion of stationary points for the non-smooth case). In addition, the benefit of the per-coordinate version is discussed.
优点
1. The proposed algorithm is shown to attain the optimal convergence rate for both smooth and non-smooth non-convex problems.
2. I like the discussion on the benefit of coordinate-wise updates.
3. The results of this paper also shed some light on why EMA-based weight averaging can be useful in practice (although it doesn't seem strictly important in the context of the proposed algorithm; see Weaknesses).
缺点
1. It appears to me that this paper's main convergence results (Theorems 11 and 19) can hold even with as defined in Lemma 7. Since Theorems 11 and 19 are not w.r.t. the last averaged iterate (which is what is used in practice), it seems that EMA-based averaging is not strictly necessary in the context of the proposed algorithms, i.e., the results hold even with the unaveraged iterates appropriately sampled (as described in Lemma 7).
2. I understand that this is a theoretical paper but because a new algorithm has been proposed involving clipping and EMA-based averaging, it would have been nice to show some empirical results at least in some simple settings. This would have made the proposed algorithm more appealing and convincing to practitioners.
问题
1. The purpose of Lemma 10 is not immediately clear to me. Is it to replace in Lemma 7 with ?
2. It seems to me that clipping is always activated in -FTRL and CLIPPED-ADAM(?) This is because with the choice of and in CLIPPED-ADAM for instance, .
3. Is the Lipschitz assumption necessary?
局限性
Discussed.
Thank you for your constructive comments! Let us address your questions one by one.
-
(Weakness 1 / Question 1) We indeed need to choose as the output, and this is crucial to achieve a -stationarity, our main notion of convergence. This notion of convergence basically requires to output a point such that both and are small. In other words, we need to find a point for which one can find a set of points around whose averaged gradients is small. In light of this, the role of the two lemmas are as follows:
- Lemma 7 gives the set of points that guarantees the smallness of , and hence we need to output its expectation as the output.
- Lemma 10 ensures that is small; in other words, it ensures that is closely clustered around its average .
-
(Weakness 2) As discussed in the introduction section, Adam with EMA is very successful for large complex neural network models. Our main goal was to understand why such a method is very effective. We consider the clipping as a minor modification, and given that, our main message is justifying the success of Adam with EMA. Hence, we do not think running further experiments is necessary, because proposing a new algorithm is not the main scope of this work.
-
(Question 2) Actually, the inequality likely goes the other way so that clipping is a no-op. In practice, we expect the sequences to be random (due to stochastic noise, unstable training trajectories etc), for which the standard concentration inequalities yield . Hence, in practical settings, we expect the clipping to be unnecessary.
-
(Question 3) The Lipschitzness condition is indeed a standard assumption in the nonconvex nonsmooth settings. However, we note that one of the main messages of our main results is that Adam lets us adapt to the Lipschitzness constants (coordinate-wise) without the knowledge of them. Note that we almost certainly would require some kind of structural assumption on the "difficulty" of the loss function, so if we were to dispense with Lipschitzness it would likely be at the cost of some other assumption (such as smoothness).
Thanks for the rebuttal; I'll keep my score. Regarding the direction of the inequality, I believe my direction is correct (with exact inequality and not inequality up to constants) if all the 's are non-negative (this can be seen by squaring both sides).
You're right of course that if all the s have the same sign, then the clipping will occur. Our observation is that this is extremely unlikely to happen. The reason is the following: imagine that the algorithm is converging (even if it is converging somewhat slowly). In such a case, we would expect the gradients to be roughly mean-zero random variables (e.g. https://arxiv.org/pdf/2111.06328 or https://arxiv.org/pdf/1508.00882), in which case the clipping should be expected to not occur.
In fact, the case in which clipping does occur frequently results in "too good to be true" results that should provide further intuitive evidence of it being unlikely. For example, https://jmlr.org/papers/v18/17-079.html shows that if the prediction of FTRL is clipped on every round then in fact we would suffer only logarithmic regret rather than the standar regret bound, and http://proceedings.mlr.press/v97/cutkosky19b.html shows that in the case that (even if clipping doesn't occur every single round), we can obtain preconditioning "for free" without any extra computation cost.
This paper examines the benefit of model exponential moving average when the model is trained with Adam. In particular, the paper uses a discount-to-nonconvex conversion framework with a scale-free FTRL learner to obtain non convex optimization guarantees. This resultant algorithm resembles a clipped variant of Adam with model EMA, thus providing optimization guarantees for it. While there are differences with the version of Adam used in practice, the results are nevertheless insightful. The reviews for the paper were mostly positive except for one reviewer with low-confidence providing a lower score. While some of their concerns were valid, I believe the rebuttal mostly addresses them. I think this paper will be valuable addition to the conference. I recommend acceptance.