Independently-Normalized SGD for Generalized-Smooth Nonconvex Optimization
For deterministic case, we study normalized GD under generalized smooth and generalized PL condition; For stochastic case, we propose Independent Normalized SGD and conduct convergence analysis under relaxed noise assumption.
摘要
评审与讨论
In this work the authors study the convergence of Normalized Gradient Descent and Normalized Stochastic Gradient Descent under generalized smoothness assumption and PL-condition.
优点
This is a theoretical work on optimization. Assumptions are presented, theorem statements are written. The Appendix contains the proofs of the theorems.
缺点
I think that the contribution of this paper is not good enough for the standards of the ICLR conference.
NGD was already proposed and analyzed in the nonconvex setting in [1]. Extending their analysis to the PL-case does not seem as a strong contribution to me.
The second question in the paper is: "Can we develop a novel algorithm tailored for stochastic generalized-smooth optimization that achieves fast convergence in practice while providing convergence guarantee under relaxed assumptions?"
There are papers with the analysis of NSGD with momentum [2], Adam [3,4], AdaGrad-Norm/AdaGrad [5,6]. All are tailored for the stochastic optimization in the generalized smooth setting and seem more advanced. All have convergence guarantees. Is the newly proposed I-NSGD faster in practice than any of them? As far as I can see, I-NSGD is compared to SGD, Clipped-SGD, NSGD in the experimental section only. The original paper [7], which introduced the generalized smoothness assumption, explains that this assumption is based on empirical findings from language model training and image classification tasks. Does I-NSGD perform well on them?
There is a recent paper [8], which contains a similar to NSGD method with smoothed clipping (Algorithm 1: (L0,L1)-GD). You mention this work in your paper. I can see that the authors develop the adaptive versions of the algorithm, the accelerated version. How does your theory compares to theirs? As far as I can tell from their contribution section, they improve upon many previous results.
[1] Ziyi Chen , Yi Zhou , Yingbin Liang , and Zhaosong Lu, Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization.
[2] Hübler, F., Yang, J., Li, X., and He, N., Parameter-Agnostic Optimization under Relaxed Smoothness.
[3] Wang, B., Zhang, Y., Zhang, H., Meng, Q., Ma, Z.-M., Liu, T.-Y., and Chen, W., Provable adaptivity in adam.
[4] Li, H., Rakhlin, A., and Jadbabaie, A., Convergence of adam under relaxed assumptions.
[5] Faw, M., Rout, L., Caramanis, C., and Shakkottai, S., Beyond uniform smoothness: A stopped analysis of adaptive sgd.
[6] Wang, B., Zhang, H., Ma, Z., and Chen, W., Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions.
[7] Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie, Why gradient clipping accelerates training: A theoretical justification for adaptivity
[8] Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richtárik, Samuel Horváth, Martin Takáč, Methods for Convex (L0,L1)-Smooth Optimization: Clipping, Acceleration, and Adaptivity
问题
There is a standard sanity check procedure in optimization. The field is connected to physics (take the heavy-ball method of Polyak, for instance). And different quantities in physics have different units of measure. Take, e.g., the mechanical work: We get that 1 Joule is 1 Newton multiplied by 1 meter.
I will try to clarify my next concern with the example from physics.
Let us consider the one-dimensional problem of minimization of the coordinate of the material point, where is a time variable. Suppose is some -smooth twice-differentiable function. In the work [7] (see above) the -smoothness reduces to the condition in this case. Let us observe that is the absolute value of the acceleration and is measured in m/s^2. Therefore, has the same unit of measure, m/s^2. In turn, is the absolute value of the speed and is measured in Therefore, has the unit of measure
In Theorem 1, when choosing the constants and are added to each other. How two quantities with different units of measure can be added to each other? This happens over the whole paper. In Theorem 2 when choosing in the minimum you compare to when you define you add them again.
In the original work [7] the stepsize is chosen as either or where is an upper bound on the gradient, which has the same unit of measure. In the work [8] as well.
The constants and are not added to each other as in your paper. They do not have this problem with the units of measure. Therefore, there might be technical flaws in your proofs.
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly and uploaded a revised manuscript. Please let us know if further clarifications are needed.
Q1: There are papers with the analysis of NSGD with momentum [2], Adam [3,4], AdaGrad-Norm/AdaGrad [5,6]. All are tailored for the stochastic optimization in the generalized smooth setting and seem more advanced. All have convergence guarantees. Is the newly proposed I-NSGD faster in practice than any of them?
A: Thanks for your question. We agree that both our work and these achieve the optimal rate under similar generalized-smoothness conditions. While these works focus on analyzing the existing adaptive-type algorithms, our work aims to propose a new and fundamental independent sampling scheme that can inspire new stochastic algorithm designs and hence is complementary to the existing studies on algorithms. We have conducted more experiments to show that our proposed I-NSGD is indeed much faster than SPIDER, Normalization SGD with momentum.
For Phase Retrieval and DRO experiments, the training loss versus sample complexity and detailed description of algorithm settings can be found at Appendix I.
the loss function evaluated every 100 epochs are
| Method | 100 | 200 | 300 | 400 | 500 |
|---|---|---|---|---|---|
| SGD | 4837.102334 | 1835.358158 | 1240.747515 | 991.178520 | 882.728049 |
| NSGD with momentum | 3419.421198 | 237.020564 | 12.536876 | 11.795320 | 11.203911 |
| NSGD | 6345.978257 | 579.544323 | 10.754439 | 10.820819 | 10.809943 |
| SPIDER | 165.234828 | 253.344905 | 222.983539 | 199.988661 | 196.264103 |
| Clipped SGD | 942.788701 | 630.029406 | 378.531543 | 187.528112 | 75.678732 |
| I-NSGD | 150.054129 | 8.492338 | 8.257874 | 8.305479 | 8.305773 |
For DRO, the detailed description of algorithm settings and training loss versus sample complexity can be found at our paper's Appendix I.
The objective function evaluated at each 200 epochs are
| Method | 200 | 400 | 600 | 800 | 1000 |
|---|---|---|---|---|---|
| SGD | 13.980096 | 11.792903 | 10.343812 | 9.370857 | 8.550016 |
| NSGD with momentum | 10.655368 | 7.589949 | 6.060729 | 5.322460 | 4.770303 |
| NSGD | 11.015746 | 8.015699 | 6.468458 | 5.478018 | 4.947710 |
| SPIDER | 6.059500 | 3.829211 | 3.559534 | 3.520244 | 3.482132 |
| Clipped SGD | 8.699038 | 6.449909 | 5.615763 | 5.003331 | 4.675120 |
| I-NSGD | 4.351718 | 3.859514 | 3.717275 | 3.636329 | 3.609229 |
| Experimental results from phase retrieval and DRO shows that I-NSGD is faster NSGD with momentum, SPIDER and Clipped SGD |
Q2: The original paper [7], which introduced the generalized smoothness assumption, explains that this assumption is based on empirical findings from language model training and image classification tasks. Does I-NSGD perform well on them?
A: Thank you for your question. The generalized smoothness condition we use belongs to proposed in [1], which is a generalized of -smooth condition proposed in [7] by inducing problem-dependent constant . This modification allows to characterize function geometry more tightly. However, per request, we conducted experiments to train ResNet on CIFAR 10 with algorithms SGD, Adam, Adagrad, NSGD with momentum, NSGD, Clipped SGD, and I-NSGD. Results are available in Appendix I in the submitted paper.
Q3: There is a recent paper [8], which contains a similar to NSGD method with smoothed clipping (Algorithm 1: -GD). You mention this work in your paper. I can see that the authors develop the adaptive versions of the algorithm, the accelerated version. How does your theory compares to theirs?
A: Thank you for pointing out this reference. We will add more discussions in related work accordingly. We notice that the proposed -GD and its acceleration is deterministic algorithm. Their analysis is conducted under convex settings. Our proposed I-NSGD algorithm's sample complexity, is obtained on non-convex stochastic settings. The stochastic extension -SGD in [8] doesn't include the independent sampling scenario and clipping structure and their convergence analysis is also under convex setting. We appreciate their work very much, but this concurrent work[8] has a totally different setting from ours.
Q4cIn Theorem 2 when choosing in the minimum you compare to , you add them again.
A: Thanks you for pointing out this question. We notice that this is indeed a typo and we are sorry for the confusion. To clarify, the learning rate in theorem 2 should be
and
For the term , this is produced from Lemma 6 in Appendix F, where we claim
When we prove this fact, we divide upper bound based on value of . When is a small constant, we upper bound the LHS by a constant; When is large, we upper bound LHS associated with constant factor. Finally, we sum the two upper bounds together to yield RHS, as they are always positive.
To elaborate, when , we have for . Since , so does . These facts lead to Here, we derive the upper bound by replacing by and multiply in front of . Moreover, by using the clipping structure , and , we have The second case where will result , we omit the discussion here. For more details, you can check Appedix F, Proof of Lemma 6.
References:
[1] Chen, Ziyi, et al. Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization.
[2] Hübler, Florian, et al. Parameter-Agnostic Optimization under Relaxed Smoothness.
[3] Wang, Bohan, et al. Provable adaptivity in adam.
[4] Li, H., Rakhlin, A., and Jadbabaie, A., Convergence of adam under relaxed assumptions.
[5] Faw, Matthew, et al Beyond uniform smoothness: A stopped analysis of adaptive sgd.
[6] Wang, Bohan, et al. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions.
[7] Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie, Why gradient clipping accelerates training: A theoretical justification for adaptivity
[8]Gorbunov,Eduard, et al. Methods for Convex (L0,L1)-Smooth Optimization: Clipping, Acceleration, and Adaptivity.
[9] Arjevani, Yossi, et al. Lower bounds for non-convex stochastic optimization.
[10]Fang, Cong, et al. "Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator." Advances in neural information processing systems 31 (2018).
[11] Cutkosky, Ashok, et al. "Momentum improves normalized sgd." International conference on machine learning. PMLR, 2020.
Q4a: There is a standard sanity check procedure in optimization. The field is connected to physics (take the heavy-ball method of Polyak, for instance). And different quantities in physics have different units of measure. In Theorem 1, when choosing the constants and are added to each other. How two quantities with different units of measure can be added to each other?
A: Great Question! We agree with your point that some optimization algorithms (such as heavy-ball method) have physical meaning in practice. And we understand some areas related with optimization, such as adaptive control, do assign physical measurement for parameters to make the system output has desired performance and reasonable interpretability. But in optimization, we do not agree with the sanity check that you proposed, i.e., assigning physical units to check learning rate's technical correctness. In generalized smooth condition, and are just scalar numbers that characterize the function's geometry, which is totally different from a real system's physical property.
On the other hand, if and has and physical units, the resulting descent lemma
contradicts to the physical units you assumed since , and should be measured in terms of . As long as the learning rate can control the update so that the generated sequences from algorithm oracle, such as , , can converge, the learning rate is suitable to use for proof. In summary, the role of learning rate is to numerically avoid sequence divergence as algorithm progresses.
Q4b: In Theorem 1, when choosing the constants and are added to each other. How two quantities with different units of measure can be added to each other?
A: Thanks for raising question about step size rule. We will address your question by listing key steps resulting the design of step-size rule. The main derivation can be found at inequality (23) in appendix, from the descent lemma, by putting the NGD update rule, Lemma E.2 derived in [1], we have
f\big(w_{t+1}\big)-f\big(w_t\big) \leq \gamma ||\nabla f\big(w_t\big)||^{2-\beta}+\frac{\gamma}{4}\big(2 ||\nabla f\big(w_t\big)||^{2-\beta}+\big(2 L_0 \gamma\big)^{\frac{2}{\beta}-1}+\big(2 L_1 \gamma\big)^{\frac{2}{\beta}-1}\big), $$ where the constant on RHS $(2L_0\gamma)^{\frac{2}{\beta}-1}$, $(2 L_1 \gamma\big)^{\frac{2}{\beta}-1}$ is obtained by utilizing **Lemma E.2.** from [1]. This lemma enables us to separate terms including $||\nabla f(w) ||^{2-\beta}$ and other constant. Then, from fact $a^{\tau}+b^{\tau}\leq (a+b)^{\tau}$ holds for any $a,b\geq 0$ and $\tau\geq 1$, we further combine $(2L_0\gamma)^{\frac{2}{\beta}-1}$, $(2 L_1 \gamma\big)^{\frac{2}{\beta}-1}$, which yieldsf\big(w_{t+1}\big)-f\big(w_t\big)\leq -\frac{\gamma}{2}||\nabla f\left(w_t\right)||^{2-\beta}+\gamma^{\frac{2}{\beta}}\big(2 L_0+2 L_1\big)^{\frac{2}{\beta}-1}
By designing $\gamma = \frac{(2\mu\epsilon)^{\beta/\rho}}{8(L_0+L_1)+1} $, we can control the constant on RHS to be small and easy to handle in notation during further deductions.We'd like to update our experiment results on training ResNet over CIFAR10 dataset. The baseline method includes SGD, Adagrad, Adam, NSGD, NSGD with momentum and Clipped SGD. The loss function we choose is cross-entropy loss. The initialization uses Kaiming initialization. The detailed algorithm settings can be found at Appendix I in our paper. We train all the baseline methods and I-NSGD on CIFAR10 Dataset for 30 epochs each. The resultant training loss recorded every 5 epochs are as following.
ResNet18 on CIFAR10
| Method | 5 | 10 | 15 | 20 | 25 | 30 |
|---|---|---|---|---|---|---|
| SGD | 3.799652861431241 | 2.6444492801092565 | 1.8017334272153676 | 1.160722715780139 | 0.744824364897795 | 0.4474578978260979 |
| Adam | 1.4727556260768324 | 0.6025010953308083 | 0.3352339585107984 | 0.21858624278684147 | 0.16716759049450047 | 0.14670390529863653 |
| Adagrad | 1.5795468527358025 | 0.47506230941507965 | 0.17185649830207694 | 0.09319731855066493 | 0.061325946466240566 | 0.04338877404734376 |
| NSGD | 1.9135879594832659 | 0.5788353047100827 | 0.292824345495319 | 0.21826101756596472 | 0.17621186005271738 | 0.15782649893662892 |
| NSGDm | 4.045813534408808 | 2.9456085418350995 | 2.174826104193926 | 1.5469997769687325 | 1.0462557922583073 | 0.6905137724243104 |
| ClipSGD | 1.737209850223735 | 0.6230987445160281 | 0.3534815391758457 | 0.2730324479562114 | 0.2233854775040527 | 0.19943709113067598 |
| INSGD | 1.6496218708343804 | 0.6025417874916457 | 0.4097965414985083 | 0.3055273554782616 | 0.26761460003763204 | 0.24892822335095843 |
From the records, we can see that INSGD demonstrates fast convergence, which is comparable with Adam, Clipped SGD and better than NSGDm, NSGD, and SGD. This demonstrates the effectiveness of the I-NSGD algorithm.
ResNet50 on CIFAR10
| Method | 5 | 10 | 15 | 20 | 25 | 30 |
|---|---|---|---|---|---|---|
| SGD | 5.990726664662361 | 4.939687485806644 | 4.287755103781819 | 3.7324438001960516 | 3.25510926079005 | 2.729151414707303 |
| Adam | 3.6487857876345515 | 2.9452997343614697 | 2.0379750484135 | 1.0010075609316118 | 0.4167641017556889 | 0.7922854552452918 |
| Adagrad | 4.240041005425155 | 2.9372318675741553 | 1.9360805025789887 | 1.320955990231596 | 0.9049579487764277 | 0.634457179432502 |
| NSGD | 6.778490765951574 | 4.921374089084566 | 3.6041279016062617 | 2.2895418484695256 | 1.3302382924593985 | 0.8264894502353854 |
| NSGDm | 7.282688681036234 | 6.463156248442829 | 5.85003749653697 | 5.367515945807099 | 5.002681262791157 | 4.6777457147836685 |
| ClipSGD | 5.527418973855674 | 3.230063008144498 | 1.6175740614999086 | 0.8517110171960667 | 0.5524751330958679 | 0.4216658901714254 |
| INSGD | 3.98776736529544 | 2.0011105102021247 | 0.9845839753979817 | 0.5838269208616111 | 0.4778438211797038 | 0.3469473898294382 |
From the above records, we can see that INSGD indeed demonstrates fast convergence, which is comparable with Adam, Adagrad and better than ClipSGD,NSGDm, NSGD, SGD. This demonstrates the effectiveness of the I-NSGD algorithm.
Thank you for your response.
I repeat that this sanity check procedure is standard in the optimization field. I do not see why the descent lemma contradicts to the procedure. The descent Lemma does not depend on your algorithm, it is based on Calculus. Physical laws do not depend on your algorithm either. By saying that the descent Lemma contradicts the physical units you essentially mean that Physical laws do not merge with Calculus. But Calculus is the language of Physics, all Physical laws are expressed through it. Both of the theories are built so that they merge.
I think I have explained myself very clearly in the previous review and made no mistake. It is a simple procedure, I will repeat it. I will set
Let us go term by term. and are measured in meters since they are coordinates of the material point at times and Next, is (the speed), is (seconds, time). Then, gives No contradiction so far! Let's go further, is therefore, has to be as has to be measured in meters: Next, needs to be since has to be in meters:
Once again, and have different units of measure. These are not just simple constants.
The primary area of your paper is chosen as optimization. Imagine I want to optimize a simple function from Physics with your algorithms (why not, it is optimization, it should work on functions from Physics). I look at your theoretical guarantees, try to use them and suddenly I fail because I can not add to
The problem is in the further analysis, I believe.
The paper is theoretical. However, the theory fails on a simple example. I will keep my score.
Thank you for raising your concerns regarding the compatibility of units in our analysis. We greatly appreciate the effort you put into articulating this potential issue and recognize the importance of ensuring consistency in physical measurements. While we respectfully disagree with your proposal to replace with t (time), we value your perspective and are eager to engage in further discussion to review the key steps that led to these terms. To foster constructive dialogue, we have highlighted the critical steps that resulted in what may appear to be a unit-inconsistent step size. Your view is important, and we hope our work can improve its applicability to physics.
Discussion on , similarly with [1], by applying NGD update rule and Young's inequality, we are able to obtain
To further simplify , we design . Because by extracting out, we have
which yields
Since , , .It leads to
\frac{1}{4}\frac{(2\mu\epsilon)^{\frac{\beta}{\rho}}(2\mu\epsilon)^{\frac{2-\beta}{\rho}}}{8(L_0+L_1)+1}, $$ which is equivalent as\frac{\gamma}{4} (2\mu\epsilon)^{\frac{2-\beta}{\rho}}.
**Discussion on $\Lambda$** TL;DR, $\Lambda$ is derived when full gradient is small than certain constant $||\nabla F(w_t) ||\leq \sqrt{1+\frac{\tau_2^2}{\tau_1^2}}$. Under this case, it's nature to upper bound $L_1 ||\nabla F(w_t) ||^{\alpha}$ by constant $L_1 (1+\frac{\tau_2^2}{\tau_1^2})^{\frac{\alpha}{2}}\leq L_1 (1+\frac{\tau_2^2}{\tau_1^2})^{\frac{1}{2}} $. And the rest of derivation aggregates $L_0$ to $L_0 (1+\frac{\tau_2^2}{\tau_1^2})^{\frac{1}{2}}$ and add together with $L_1 (1+\frac{\tau_2^2}{\tau_1^2})^{\frac{1}{2}}$, yielding\frac{1}{2h_t^{2\beta}}\gamma^2(L_0+L_1)(1+\frac{\tau_2^2}{\tau_1^2})\frac{\tau_2^2}{\tau_1^2}\leq \frac{1}{2h_t^{2\beta}}\gamma^2(L_0+L_1)(1+\frac{\tau_2^2}{\tau_1^2})(1+\frac{\tau_2^2}{\tau_1^2})
Since $h_t = \max\Big(1, (4L_1\gamma)^{\frac{1}{\beta}}\Big({2||\nabla f_{\xi_{B'}}(w_t)||+ \delta}\Big) \Big)\geq 1$, we have $\frac{1}{h_t^{2\beta}}<1$, This further yields exact expression of $\Lambda$. The above derivation is grounded in established mathematical rules and inequalities. We understand you have concerns about potential technical flaws. Could you kindly specify which steps you believe may contain such flaws and which ones you find counterintuitive from a physical perspective? Your insights would be invaluable in helping us address these concerns effectively. [1] Chen, Ziyi, et al. Generalized-Smooth Nonconvex Optimization is As Efficient As Smooth Nonconvex Optimization.The paper analyzes stochastic normalized SGD with an extra sampling under generalized smoothness assumption and obtains known complexity rate. For the deterministic case paper additionally analyzes normalized SGD under generalized Polyak-Łojasiewicz condition.
优点
- the convergence of normalized gradient descent under the generalized Polyak-Łojasiewicz (PŁ) condition
- generalized variance assumption
缺点
- only symmetric case, which is much easier to analyze; [Hubler etal, 2024] (mentioned by authors) obtains the same rate with non-symmetric, more general smoothness setup
- proposed I-NSGD requires additional sampling, while does not improve the known complexity rate. [Hubler etal, 2024] does not need additional sampling and has the same rate.
- having the above, [Hubler etal, 2024] is not experimentally compared
问题
- I would like to hear authors comment on the result with Polyak-Łojasiewicz condition. Despite I am not aware of such a result and it is clearly a contribution then, its value seems limited. I would like authors to explain in details why this result is important. I would also like an explanation of why NSGD is considered under this setup and not SGD for example. Does NSGD remarkably benefit from Polyak-Łojasiewicz setup?
- But my main concern is contribution of proposed I-NSGD. The most important factor is the focus of the paper on only symmetric generalized smoothness. Non-symmetric smoothness holds for a problem formulation appearing in Distributionally Robust Optimization (Jin et al., 2021). Chen et al. (2023) also show that exponential function satisfies non-symmetric case. Moreover, the rate is the same despite the less general assumption. So, I want authors comment on why they think the case of less general smoothness overweights more general variance assumption. Can authors argue that the more general variance assumption is not compatible with a kind of [Hubler etal, 2024] analysis? Why haven't authors included [Hubler etal, 2024]'s NSGD with momentum to experimental comparison? It is highly related method to the best of my understanding. By now, neither experiments nor theory tells me that I-NSGD is preferable in any scenario.
References: Jikai Jin, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34: 2771–2782, 2021 Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning, pp. 5396–5427. PMLR, 2023.
Q3: Can authors argue that the more general variance assumption is not compatible with analysis conducted in the work [3]?
A: Great question! We have added some discussions in our related work accordingly. [3] studies normalized gradient with momentum under -generalized smoothness with bounded variance, and the learning rate they adopt is in the parameter-free regime. We found that their Lemma 13 in appendix served as a key inequality under their variance assumptions, where they upper bound by some terms including
\mathbb{E}\Big[||\sum_{\tau=1}^t \beta_{(\tau+1): t} \alpha_\tau (g_{\tau}-\nabla F(w_{\tau}))||\Big] \leq \sqrt{\sum_{\tau=1}^t \beta_{(\tau+1): t}^2 \alpha_t^2 \sigma^2}, $$ where $\sigma^2$ represents their variance. If we consider the more general variance condition $||g_{\tau}-\nabla F(w_{\tau}) ||\leq \tau_1 ||\nabla F(w_{\tau}) ||+\tau_2$ used in our work for the above inequality, it will induce additional terms $||\nabla F(w_{\tau})||^2$ in order to bound $g_{\tau}-\nabla F(w_{\tau})$, which is no longer compatible with their original analysis. Moreover, our I-NSGD doesn't include momentum $m_t$ and parameter-agnostic step size rule in algorithm, Lemma 13 and their convergence analysis doesn't apply to our analysis. So our analysis is not compatible with [3]. **Q4:** Why haven't authors included [Hubler etal, 2024]'s NSGD with momentum to experimental comparison? Can you add experiments with NSGD with momentum for comparison? **A** Thanks for your suggestion, in the new experiments, we add NSGD with momentum and SPIDER for comparison. The training loss versus sample complexity and detailed description of algorithms'setting can be found at Appendix I of our paper. For Phase retrieval experiments, the loss function evaluated every 100 epochs are | Method | 100 | 200 | 300 | 400| 500 | |---|---|---|---|---|---| | SGD |4837.102334|1835.358158|1240.747515|991.178520|882.728049| | NSGD with momentum |3419.421198|237.020564|12.536876|11.795320|11.203911| | NSGD | 6345.978257 |579.544323|10.754439|10.820819|10.809943| | SPIDER | 165.234828 |253.344905 |222.983539 |199.988661|196.264103| | Clipped SGD | 942.788701| 630.029406|378.531543|187.528112|75.678732| | I-NSGD |**150.054129**|**8.492338**|**8.257874**|**8.305479**|**8.305773**| For DRO, the detailed description of algorithm settings and the training loss versus sample complexity can be found at Appendix I. The objective function evaluated at each 200 epochs are | Method | 200 | 400 | 600 | 800| 1000 | |---|---|---|---|---|---| | SGD | 13.980096 |11.792903 |10.343812 |9.370857 |8.550016 | | NSGD with momentum|10.655368 |7.589949 |6.060729 |5.322460 |4.770303 | | NSGD |11.015746 |8.015699 |6.468458 |5.478018 |4.947710 | | SPIDER |6.059500 |3.829211 |3.559534 |3.520244 |3.482132 | | Clipped SGD |8.699038 |6.449909 |5.615763 |5.003331 |4.675120 | | I-NSGD |**4.351718**|**3.859514**|**3.717275**|**3.636329**|**3.609229**| From the recorded training loss for Phase Retrieval and DRO, we saw that I-NSGD is significantly faster than other baseline methods, Clipped SGD, SPIDER, NSGD, and NSGD with momentum. This verifies the effectiveness of proposed I-NSGD algorithm. References: [1]Jin, Jikai, et al. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34: 2771–2782, 2021 [2]Chen, Ziyi, et al. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. In International Conference on Machine Learning, pp. 5396–5427. PMLR, 2023. [3] Florian Hubler, et al. Parameter-agnostic optimization under relaxed smoothness. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 2024. [4] Liu, Chaoyue, et al. "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks." Applied and Computational Harmonic Analysis 59 (2022): 85-116. [5] Scaman, Kevin, et al. "Convergence rates of non-convex stochastic gradient descent under a generic lojasiewicz condition and local smoothness." International conference on machine learning. PMLR, 2022. [6]Li, Haochuan, et al. "Convex and non-convex optimization under generalized smoothness." Advances in Neural Information Processing Systems 36 (2024).Thank for your response.
-
I would like to thank authors explaining values of PL. But I wanted to say that PL simplifies analysis and theory and limits the contribution. That explains my score, just does not seem good enough to me. It would be grate to see an interesting unique problem satisfying your setup, or an advance in theory allowing to obtain much better bound, for example without the additional exponent (see below).
-
According to
Chen, Ziyi, et al. "Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization." International Conference on Machine Learning. PMLR, 2023.
the inequality in the authors' answer to Q1b has additional exponent in symmetric case. As a far as I understand, authors picked the less general assumption because they just can hide the this additional exponent to . Results of
Vankov, Daniil, et al. "Optimizing -Smooth Functions by Gradient Methods." arXiv preprint arXiv:2410.10800 (2024).
give promises, that for GD the right rates should recover the standard one . The work is very recent, I cannot appeal to it, so it is just a comment, but it should explain why i think symmetric case is much trickier.
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly and uploaded a revised manuscript. Please let us know if further clarifications are needed.
Q1a: Why (generalized) Polyak-Łojasiewicz condition is important? Its value seems to be limited.
A: Great Question! we have updated our related work to discuss more about the motivation of studying generalized PL condition. Initially, the PL condition attracted researchers' attention because it relaxes the strong convexity requirement. Recently, from the deep learning perspective, [5] found that PL* -- a variant of PL condition widely exists in loss landscape of over-parametrized neural networks. They also provide explanation why SGD-type algorithm can converge fast when training over-parametrized neural network with mean-squared loss. Later, [6] further proposed a more general -KL* condition to include more loss function such as the CE loss, hinge loss and also provide convergence guarantee of SGD. Interestingly, it turns out that the generalized PL condition is a sub-class of -KL* condition if is non-negative. This is because if , we can rewrite the generalized PL condition as
where the function is non-decreasing for . Based on these observations, we are motivated to analyze the convergence of stochastic algorithm under the generalized PL condition.
Q1b: In deterministic case, why NGD is considered under this setup but not GD? Does NGD remarkably benefit from Polyak-Łojasiewicz setup?
A: Great Question! In the deterministic setting, [2] showed that generalized-smooth functions have the following challenging local geometry
where the term induced by generalized-smoothness scales polynomially with the gradient norm. This motivates [8] to apply NGD update to control such an error term induced by generalized-smooth geometry.
In [2], the authors also have conducted experiments showing that NGD outperforms GD in solving various nonconvex generalized-smooth machine learning problems. Moreover, the previous work [6] has studied GD under -generalized smooth. The convergence result, and -smooth requires strong assumptions on initial gradient bound , which is impractical. These evidences indicate NGD is more adaptive to the generalized-smooth geometry than GD, and motivates us to continue studying NGD under the generalized PL condition, which has attracted much attention in optimization and deep learning.
Q2: The most important factor is the focus of the paper only on asymmetric generalized smooth, symmetric smoothness holds for a problem appearing in [2]. Why the author picks the less general smoothness condition? Why they think the case of less general smoothness overweights more than general variance assumption?
A: Thanks for your question! We study the asymmetric generalized-smoothness just to simplify the calculations in the proof. We note that our analysis can be directly generalized to the symmetric generalized smooth case with minor modifications. Specifically, in the deterministic case, we can choose the following NGD learning rate instead.
and
where . The same conclusion and proof logic still hold. The stochastic setting can be similarly addressed by using a slightly smaller learning rate.
To provide a intuitive justification, note that the only difference between asymmetric and symmetric generalized-smoothness is that the definition of asymmetric smoothness depends on whereas the symmetric smoothness depends on with . Such a difference is marginal since the distance between and is bounded very small in the proof (by using a small learning rate). Thus, we use the asymmetric version merely for simplicity of the proof. On the other hand, we note that the machine learning applications (DRO, deep networks, etc) mentioned in the paper satisfies asymmetric generalized smooth. The proof of DRO problem can be found at proposition 3.2 in [2].
Thanks for your quick reply. We'd like to add more discussions regarding on your claim.
Claim: PL condition simplifies the proof.
Discussion: In Chen et al, the authors have already studied NGD convergence in non-convex settings and reach iteration complexities. Our study continues contributing through this line under generalized PL condition. Unlike applying the original PL condition, which can directly replace by to lead linear convergence, our generalized PL condition is more difficult to handle with because of and . These difficulties finally lead us to the descent lemma
\Delta_{t+1} \leq \Delta_t - \frac{\gamma (2\mu)^{\frac{2-\beta}{\rho}}}{4} \Delta_t^{\frac{2-\beta}{\rho}}, $$ where $\Delta_t = f(w_t)-f^*$. We have to discuss the convergence behavior of this sequence by different values of $\rho$ and $\beta$, which correspond to three different case states in our Theorem 1. **Claim**: We use asymmetric generalized smooth to hide mathematical tricks. **Discussion**: We agree with you that the symmetric generalized smooth condition will result additional terms on RHS as we wrote in Q1b. And the descent lemma isf\left(w^{\prime}\right) \leq f(w)+\nabla f(w)^{\top}\left(w^{\prime}-w\right)+\frac{1}{2}||w^{\prime}-w||^2\left(K_0+K_1||\nabla f(w)||^\alpha+2 K_2||w^{\prime}-w||^{\frac{\alpha}{1-\alpha}}\right) .
when $\alpha\in (0,1]$.And,f\left(w^{\prime}\right) \leq f(w)+\nabla f(w)^{\top}\left(w^{\prime}-w\right)+\frac{1}{2}||w^{\prime}-w||^2\left(L_0+L_1||\nabla f(w)||\right) \exp \left(L_1 ||w^{\prime}-w||\right)
when $\alpha=1$. As we have explained in **Q2**, in this scenario, by choosing $$ \gamma = \frac{(2\mu\epsilon)^{\beta/\rho}}{12(K_0+K_1+2K_2)+1} \text{ when }\alpha \in (0,1),and
NGD will lead to the same convergence conclusion. The proof logic is same as the proof given in Appendix, except inducing another small component . To deal with terms including , , and , Chen et al. introduce Lemma E.1, where they utilize Young's inequality to unify the orders of all terms containing to . This technique is also used in our proof when dealing with ,.
We'd like to thank the reviewer for pointing out this excellent reference in your comment. We saw that their convergence rate indeed touched the desired bound in different cases. But conclusions are based on generalized smooth, where Chen et al. classify such condition as -generalized smooth. And it is also a sub-class of symmetric generalized smooth defined in the paper done by Chen et al. It's unclear regarding your statement, "Results from Vankov Daniil et al. give promises that for GD the right rates should recover the standard one." because the authors didn't analyze algorithms under symmetric generalized smooth with consideration of and existence of . We're not sure if their conclusions have direct implications for potential improvements in our results under symmetric generalization smooth conditions, as the designed learning rates are different and some of their convergence results are dependent on radius , where we didn't assume such constant. Could you explain it more explicitly? We are happy to respond to your concerns.
References
Vankov, Daniil, et al. "Optimizing -Smooth Functions by Gradient Methods." arXiv preprint arXiv:2410.10800 (2024).
Chen, Ziyi, et al. "Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization." International Conference on Machine Learning. PMLR, 2023.
This paper considers the problem of nonconvex optimization under the smoothness condition (also known as generalized smoothness) and specifically considers two problems: (a) adapting to the generalized Polyak-Łojasiewicz (PŁ) condition, a generalization of strong convexity, when it exists, and (b) solving the stochastic problem, with or without the PŁ condition. For the former problem, the authors show that the normalized gradient descent method can achieve different (faster) rates depending on the PŁ condition, recovering linear convergence in the classical case ( in generalized smoothness and in the generalized PŁ condition). For the stochastic problem, the authors consider a variant of normalized gradient descent where the normalization is done with a different, independently sampled point. Theorem 2 shows that this method can achieve the convergence rate (to achieve error ) without the need for large (-dependent) batch sizes or assuming that the stochastic gradient noise norm is bounded almost surely. Finally, experiments show the proposed algorithm performs well compared to the baselines (SGD, Normalized SGD, SGD with clipping).
优点
- The paper's idea is pretty nice, I'm quite surprised independent normalization has not been considered in the literature before-- it's such a simple and straightforward solution to the problem of correlated samples in normalization. That said, it might be the case that it works worse in practice than normalized SGD (see the weaknesses section), but we know that normalized SGD may be nonconvergent in general. Independent samples have been used before in the estimation of the smoothness constant in ordinary (smooth) optimization [1], but not in generalized smoothness optimization as far as I can see.
- The analysis removes the requirement for large batch sizes or almost surely bounded gradient noise present in prior work, although it still requires very specific choices of (constant) batch sizes.
- The paper is written quite clearly, and is easy to read.
[1] Malitsky and Mischenko, Adaptive Gradient Descent without Descent, arXiv 1910.09529 (2020).
缺点
- (Unrealistic requirement on the batch size) The batch sizes used, while constant, depend directly on the constants and in Assumption 4. These constants are in general not knowable, how are we to choose a batch size that depends on them when we can't know them? Can Theorem 2 be generalized to hold for any arbitrary batch size instead? For example, it seems that the rate explodes if , which should be an easier case.
- (Unfair experimental evaluation) It seems that for I-NSGD, you tune the parameter while for NSGD you do not? For a fair comparison, it should be tuned as well. The same applies to gradient clipping: the text states you did tune the learning rate for the algorithm, but not the clipping constant. That should be tuned.
- (Limited technical novelty) The technical novelty in this paper is really quite limited, save for using independent normalization all the tools used are quite direct and similar to prior work.
Overall, despite the drawbacks, I lean towards recommending acceptance for the paper
问题
- Please address my concerns in the weaknesses sections.
- Why is the generalized PŁ condition interesting to study? I know you do give an example in lines 172-175 but how relevant is this in practice?
Q3: The technical novelty is limited except for using independent normalization. All other tools used are quite direct and similar to prior work. Can you explain explicitly about the technical novelty and technical tools?
A: We would like to briefly compare our work and the previous work [1] and [2]. [1] is the first work to propose -generalized smoothness and study Clipped SGD's convergence under bounded stochastic gradient bias assumption. Later, [2] proposes the generalized-smoothness and studied the convergence of SPIDER [3] under affine bounded variance assumption. Their batch size choices are , which can be impractical. Our work continued on contributing to this line of research, we proposed the I-NSGD with newly designed learning rate and clipping rule tailored for optimizing generalized-smooth functions under the relaxed gradient variance assumptions. Moreover, our algorithm design achieves convergence under constant-level batch size.
We agree that independent normalization is a simple modification in the algorithm design, yet the resulting theoretical convergence analysis turns out to be highly different and more effective than the existing literature. In particular, the affine bounded gradient assumption and the normalization parameter bring new challenges into the stochastic setting. To be specific, we theoretically designed new learning rate , and adaptive normalization term to guarantee convergence with constant level batch size choices under generalized-smoothness. For example, in the main proof of Theorem 2 (in Appendix), the inequalities (36), (37) and (40) utilizes the proposed learning rate and adaptive normalization factor to bound the coefficient by . In Lemma 6, the induced from the generalized-smooth condition is upper bounded by combining two upper bounds together derived based on different values of . These techniques and resultant inequalities are indirect to handle without rigorously mathematical reasoning.
Q: Why is the generalized PL condition interesting to study? I know you do give an example in lines 172-175 but how relevant is this in practice?
A: Great Question! we have updated our related work to include more discussions on the motivation of studying generalized PL condition. To elaborate, the PL condition initially caught researchers' attention because it relaxes the strong convexity requirement but yield linear convergence rate. Recently, [4] found PL*, a variation of PL condition widely exists in loss landscape of over-parametrized neural networks. They also provide explanation why SGD-type algorithm can converge fast when training over-parametrized neural network with mean-squared loss. Later, [5] further propose -KL* condition to include a broader class of loss functions such as CE loss, hinge loss and also provide convergence guarantee of SGD algorithm. Interestingly, our generalized PL condition turn out to be a sub-class of -KL* condition if is non-negative. Because if , we can rewritten generalized PL condition as
where function is non-decreasing for .
References
[1] Zhang, Jingzhao, et al. "Why gradient clipping accelerates training: A theoretical justification for adaptivity." arXiv preprint arXiv:1905.11881 (2019)
[2] Chen, Ziyi, et al. "Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization." International Conference on Machine Learning. PMLR, 2023
[3] Fang, Cong, et al. "Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator." Advances in neural information processing systems 31 (2018).
[4] Liu, Chaoyue, et al. "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks." Applied and Computational Harmonic Analysis 59 (2022): 85-116.
[5] Scaman, Kevin, et al. "Convergence rates of non-convex stochastic gradient descent under a generic lojasiewicz condition and local smoothness." International conference on machine learning. PMLR, 2022.
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly and uploaded a revised manuscript. Please let us know if further clarifications are needed.
Q1a Can theorem 2 hold for any batch size ? When , the learning rate explodes.
A: Thanks for pointing this out. We set the batch size as just to simplify the calculations in the proof. In fact, our proof still holds for the more relaxed batch size choice , which resolves the learning rate explosion issue. Since is normally a numerical constant, we can use any constant-level batch size. We have updated the theorem statement and its proof accordingly. In our experiments, we showed that for both Phase Retrieval and Distributionally Robust Optimization problems, constant-level batch sizes like and always work well.
Q1b: In theorem 2, the batch size is directly related with and , which are generally unknown. How can we choose a batch size when we don't know and ?
A: As we explained in the response to Q1, our proof still holds for the more relaxed batch size choice , which allows to use any constant-level batch size since is normally a numerical constant. All our experiments use constant-level batch sizes.
Q2: The experiment is not fair enough: 1. for I-NSGD, you tune while for NSGD you didn't; 2. For clipping constant, this hyper-parameter should be tuned as well. Can you report the performance of additional experiment results?
A: Yes, we have conducted more experiments per the reviewer's request. We first want to clarify that the existing algorithms such as Clip-SGD and NSGD use the normalization hyper-parameter by default. In our new experiment,we add baseline methods: NSGD with momentum and SPIDER algorithm for comparison. And per request from you, we unify all the normalization algorithm's as . The training loss curve versus sample compelxity, and the detailed algorithm settings can be found at our updated Appendix I. The objective value evaluated every 100 iterations are listed as following
| Method | 100 | 200 | 300 | 400 | 500 |
|---|---|---|---|---|---|
| SGD | 4837.102334 | 1835.358158 | 1240.747515 | 991.178520 | 882.728049 |
| NSGD with momentum | 619.606991 | 20.211925 | 9.082003 | 8.980433 | 8.633882 |
| NSGD | 718.194839 | 65.204719 | 8.767983 | 8.818077 | 8.699032 |
| SPIDER | 47.629694 | 48.652767 | 36.696431 | 45.175731 | 45.658158 |
| Clipped SGD | 665.092092 | 10.367391 | 8.263630 | 8.250133 | 8.219856 |
| I-NSGD | 150.054129 | 8.492338 | 8.257874 | 8.305479 | 8.305773 |
For the DRO problem, the result recorded every 200 iterations is
| Method | 200 | 400 | 600 | 800 | 1000 |
|---|---|---|---|---|---|
| SGD | 13.980096 | 11.792903 | 10.343812 | 9.370857 | 8.550016 |
| NSGD with momentum | 6.010543 | 4.569586 | 4.068003 | 3.896011 | 3.797351 |
| NSGD | 6.421145 | 4.610153 | 4.186758 | 3.916136 | 3.789031 |
| SPIDER | 5.032559 | 3.795189 | 3.596386 | 3.540992 | 3.507888 |
| Clipped SGD | 4.633071 | 3.900307 | 3.943651 | 3.662261 | 3.642133 |
| I-NSGD | 4.429903 | 3.892592 | 3.736907 | 3.648593 | 3.615151 |
| The experiment conducted on Phase retrieval and DRO provide evidence of I-NSGD faster convergence than baseline methods, such as SPIDER, NSGD with momentum, NSGD, and Clipped SGD |
-
I understand that is a constant but again if I need my batch size to be at least , how can I find that? As far as I understand there's no good algorithm for estimating globally.
-
Thanks for the updated experiments. I'd also tune per-algorithm rather than fix it for everyone. That should be a more fair comparison.
-
Thanks for the discussion on novelty and more motivation for the generalized PL condition. I believe the latter should be included in the paper.
Thanks for your reply! Regarding your concern about our batch size choice, which is related to originating from assumption 4, we want to explain in more detail.
In stochastic optimization literature, people assume either a constant bound or affine bound like our assumption 4, mainly to upper bound expectation of the second momentum of stochastic gradient. Mathematically, it can be expressed as
where is the norm of full gradient.
The motivation for doing this is because, in stochastic optimization, the theoretical evaluation metric for an algorithm is in terms of instead of stochastic estimator , which can vary at each trial. To eliminate the effect of randomness, assumptions like bounded/affine bounded estimation bias or bounded/affine bounded variance are necessary when conducting convergence proof in terms of . The usage of these assumptions can vary depending on specific algorithm structure.
Back to the derived batch size from assumption 4, we agree there is no intuitive way to estimate . It's possible that for low dimension (2d, 1d)data, exploratory data analysis might help to determine the value of accurately. However, by tuning batch sizes from 16, 32, 64, and so on, users can find the appropriate batch size. Since in standard deep learning training pipeline, data pre-processing, which includes normalization (i.e., normalized image matrix or word embeddings), can control , so that it wouldn't be a huge constant. As we have demonstrated in experiments of Phase retrieval, DRO, and training ResNet(See Appendix I), is enough to guarantee convergence for most problems using I-NSGD.
Previous works, such as [1], [2],[3],[4], either require the batch size of their stochastic algorithms to be to guarantee convergence under relaxed variance assumption, or they require strong assumption such that to guarantee convergence with batch size [Note: represents the desired algorithm accuracy, which is a small constant in (0,1)]. Our work balances the assumptions and resultant batch size, which said the newly proposed I-NSGD algorithm can converge with batch size under relaxed gradient bias assumption, .
References
[1] Chen, Ziyi, et al. "Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization." International Conference on Machine Learning. PMLR, 2023
[2] Reisizadeh A, Li H, Das S, et al. Variance-reduced clipping for non-convex optimization[J]. arXiv preprint arXiv:2303.00883, 2023.
[3] Zhang, Jingzhao, et al. "Why gradient clipping accelerates training: A theoretical justification for adaptivity." arXiv preprint arXiv:1905.11881 (2019)
[4]Zhang B, Jin J, Fang C, et al. Improved analysis of clipping algorithms for non-convex optimization[J]. Advances in Neural Information Processing Systems, 2020, 33: 15511-15521.
This paper examines non-convex optimization under the well-known generalized smoothness condition and an extended (PL) condition. Under these relaxed assumptions, they establish the convergence rate of the Normalized Gradient Descent (NGD) method under both deterministic and stochastic regimes. In particular, for the stochastic setting, they address the limitations of NGD by using independent samples to compute the direction and normalization factor of the stochastic gradient. This method, referred to as Independently-Normalized SGD (I-NSGD), provably converges in a rate of .
优点
Overall, this is a solid and well-written paper. The proofs have a good presentation and appear to be sound based on my reading. The rates are novel under the generalized PL condition. In particular, the idea of using different samples for the direction and scale of the normalized stochastic gradient estimator seems to be an interesting trick and might find its application in similar situations.
缺点
Nevertheless, the major concern comes from the novelty and the contribution perspective. This paper is clearly an extension of [Chen et al., 2023] and establishes new rates under the generalized PL condition. While some rates are indeed novel, no results really outperform the known results in comparable settings. Moreover, the PL condition seems to greatly restrict the universality of these rates: although PL does not imply convexity, it allows the objective to have favorable geometric properties that lead to fast rates, sometimes even comparable to the convex or strongly-convex setting. This makes the analysis under PL to be less general and less interesting.
A more serious concern is about the stochastic regime. The authors argue the analysis of I-NSGD outperforms [Zhang et al., 2021] by using an affine noise bound, as in Assumption 4. Nevertheless, similar assumptions have already been explored in other prior works like [1], [2], and [3], where convergence rates of are established for adaptive gradient methods when the same general smooth assumption is satisfied. This reduces the contribution of this work, as the above results already achieve the optimal rate under similar conditions. A more thorough discussion of the connection to the literature would help the readers have a more complete understanding of the whole picture.
As a result, it seems that the current paper does not bring a lot of novel insights to the optimization community. I tend to reject this paper unless they are able to provide a more convincing argument to defend the broader impact of this work.
References
[1] Bohan Wang, Huishuai Zhang, Zhiming Ma, and Wei Chen. Convergence of AdaGrad for non-convex objectives: simple proofs and relaxed assumptions. COLT 2023.
[2] Matthew Faw, Litu Rout, Constantine Caramanis, and Sanjay Shakkottai. Beyond uniform smoothness: a stopped analysis of adaptive SGD. COLT 2023.
[3] Yusu Hong, Junhong Lin. Revisiting Convergence of AdaGrad with Relaxed AssumptionsRevisiting Convergence of AdaGrad with Relaxed Assumptions.
问题
There is no further questions.
References
[1] Wang, Bohan, et al. Convergence of AdaGrad for non-convex objectives: simple proofs and relaxed assumptions. COLT 2023.
[2] Faw, Matthew, et al. Beyond uniform smoothness: a stopped analysis of adaptive SGD. COLT 2023.
[3] Hong, Yusong, et al. Revisiting Convergence of AdaGrad with Relaxed Assumptions.
[4]Arjevani, Yossi, et al. Lower bounds for non-convex stochastic optimization.
[5] Liu, Chaoyue, et al. "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks." Applied and Computational Harmonic Analysis 59 (2022): 85-116.
[6] Scaman, Kevin, et al. "Convergence rates of non-convex stochastic gradient descent under a generic lojasiewicz condition and local smoothness." International conference on machine learning. PMLR, 2022.
[7] Zhang, Jingzhao, et al. "Why gradient clipping accelerates training: A theoretical justification for adaptivity." arXiv preprint arXiv:1905.11881 (2019).
[8] Kingma, Diederik et al. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly and uploaded a revised manuscript. Please let us know if further clarifications are needed.
Q1 PL condition greatly restrict the universality of these rates. Although PL does not imply convexity, it allows the objective to have favorable geometric properties that lead to fast rates, sometimes even comparable to the convex or strongly-convex setting. Can author explain the value of inducing generalized PL condition?
A: Great Question! we have updated our related work to discuss more about the motivation of studying generalized PL condition. Initially, the PL condition attracted researchers' attention because it relaxes the strong convexity requirement. Recently, from the deep learning perspective, [5] found that PL* -- a variant of PL condition widely exists in loss landscape of over-parametrized neural networks. They also provide explanation why SGD-type algorithm can converge fast when training over-parametrized neural network with mean-squared loss. Later, [6] further proposed a more general -KL* condition to include more loss function such as the CE loss, hinge loss and also provide convergence guarantee of SGD. Interestingly, it turns out that the generalized PL condition is a sub-class of -KL* condition if is non-negative. This is because if , we can rewrite the generalized PL condition as
where the function is non-decreasing for . Based on these observations, we are motivated to analyze the convergence of stochastic algorithm under the generalized PL condition.
Q2: The authors argue that the analysis of I-NSGD outperforms [7] by using an affine noise bound stated in assumption 4. Nevertheless, similar assumptions have already been explored in other prior works like [1], [2], and [3], where convergence rates of are established for adaptive gradient methods when the same general smooth assumption is satisfied. This reduces the contribution of this work, as the above results already achieve the optimal rate under similar conditions. Can author add a more thorough discussion of the connection to the literature would help the readers have a more complete understanding of the whole picture?
A: Thanks for sharing these previous works. We have added a discussion in the related work to compare with these works, and we summarize it below for your reference. We agree that both our work and the works [1-3] achieve the optimal rate under similar generalized-smoothness conditions. While [1-3] focus on analyzing the existing adaptive-type algorithms, our work aims to propose a new and fundamental independent sampling scheme that can inspire new stochastic algorithm designs and hence is complementary to the existing studies on algorithms.
Discussion: The references [1], [2], [3] focus on analyzing the existing algorithms from the AdaGrad family. The update rule of Adagrad leverages momentum and adaptive gradients, and hence is substantially different from the basic normalized SGD and clipped SGD. To elaborate, in Adagrad, the normalization term is accumulated as , the update rule is . The normalized SGD and clipped SGD takes the form or , where and are some constant. The accumulation term will lead huge difference for convergence analysis.
Instead of focusing on the existing algorithms, our work aims to develop a new and fundamental sampling scheme, and further apply it to Clipped SGD to solve generalized-smooth problems. Based on this new sampling scheme, we are able to relax the constant upper bound developed in [7] for controlling the stochastic gradient bias and achieve a comparable convergence rate with constant-level batch sizes. To elaborate the novelty of our independent sampling, as illustrated in Figure 1, normalized SGD/clipped SGD without independent sampling take biased updates in expectation. By leveraging independent sampling, our proposed I-NSGD can reduce bias in expectation especially when the stochastic gradients have diverse directions. Furthermore, in the design of our I-NSGD, we specifically adapt the hyper-parameters to the generalized-smoothness and affine stochastic gradient noise to improve the analysis under independent sampling. As a result, the technical proof of I-NSGD is substantially different from that of the existing algorithms.
This paper considers the problem of nonconvex optimization under the -smoothness condition (also known as generalized smoothness) and specifically considers two problems: (a) adapting to the generalized Polyak-Łojasiewicz (PŁ) condition, a generalization of strong convexity, when it exists, and (b) solving the stochastic problem, with or without the PŁ condition. For the former problem, the authors show that the normalized gradient descent method can achieve different (faster) rates depending on the PŁ condition, recovering linear convergence in the classical case ( in generalized smoothness and in the generalized PŁ condition). For the stochastic problem, the authors consider a variant of normalized gradient descent where the normalization is done with a different, independently sampled point. Theorem 2 shows that this method can achieve the convergence rate (to achieve error ) without the need for large (-dependent) batch sizes or assuming that the stochastic gradient noise norm is bounded almost surely. Finally, experiments show the proposed algorithm performs well compared to the baselines (SGD, Normalized SGD, SGD with clipping).
The paper received for 4 reviews, with scores 6, 5, 3 and 3. So, three out of four reviewers recommend rejection. I've read the reviews, the rebuttals, and the ensuing discussion. I have decided to ignore several pieces of criticism, such as the "physical units" criticism by one of the reviewers. Some other comments by the same review were valid, however. I found the replies by the authors satisfactory. Overall, I like the paper, but believe it is a bit short of what I would expect from a theoretical ICLR paper.
My own view is mostly aligned with that of the positive reviewer:
- The paper's idea is nice, it's surprising that independent normalization has not been considered in the literature before -- it's such a simple and straightforward solution to the problem of correlated samples in normalization.
- That said, it might be the case that it works worse in practice than normalized SGD (see the weaknesses section), but we know that normalized SGD may be nonconvergent in general. Independent samples have been used before in the estimation of the smoothness constant in ordinary (smooth) optimization [1], but not in generalized smoothness optimization as far as I can see.
- The analysis removes the requirement for large batch sizes or almost surely bounded gradient noise present in prior work, although it still requires very specific choices of (constant) batch sizes.
- The paper is written quite clearly, and is easy to read.
The main issues include:
- Unrealistic requirement on the batch size
- Unfair experimental evaluation
- Limited technical novelty
- Add more emphasis on the novelty of the algorithm and not necessarily of the complexity results for the problem class considered, since other equally good or better methods exist for that (e.g., incorporate your reply to the reviewer who raised these concerns to the main body of the paper)
I believe that, after a major revision (focusing on the key meaningful criticism in the reviews), the paper would be in a good enough shape to be acceptable to a top AI/ML conference. I encourage the authors to perform such as revision.
审稿人讨论附加意见
I am ignoring the criticism raised from one reviewer about physical units. While the discussion was interesting, I believe that it unnecessarily side-tracked the focus on aspects of the paper that are not central to the main contributions. Also, I do not agree that physical units analysis is a standard tool in optimization.
Reject