Finding Second-order Stationary Points for Generalized-Smooth Nonconvex Minimax Optimization via Gradient-based Algorithm
摘要
评审与讨论
In this paper, the authors consider the nonconvex minimax problems. Specifically, the paper proposes a gradient-based accelerated method for finding a second-order stationary point for the nonconvex-strongly-concave minimax optimization under the so-called generalized smoothness condition. This can be regarded as extension of the existing work Luo et al. (2022), Huang et al. (2022) and Yang et al. (2023) with the Lipschitz Hessian condition. The theoretical comparison is listed clearly in Table 1.
优点
The paper addresses an important theoretical questions on designing efficient optimization algorithms for finding second-order stationary points of nonconvex minimax problems under general smooth condition. It extended the existing work to more general cases which can arise in various machine learning problem.
缺点
-
While the extension to the general smooth case is interesting and important, the design of optimization algorithms relies on the combination of the Nesterov’s classical Accelerated Gradient Descent and tricks from Li and Lin (2022). Could the authors comment what are the main novelty on the algorithmic design compared with the existing literature?
-
The proofs seem to heavily depend on the previous work Luo et al. (2022), Huang et al. (2022) and Yang et al. (2023) which assume Lipschitz Hessian condition. The proof novelty can be incremental (see my questions below).
-
The computational gain from the experiment compared with RAHGD and Clip RAHGD seems limited.
问题
-
The authors summarised the main challenges in Section 4.1. It would be helpful if the authors describe in the proof sketch which lemmas are main novel ones.
-
The formulation of Domain-Adversarial Neural Network given by (9) seems not right. There are no label for target domain. In the loss function contain for target domain. Do I miss something here?
Q3: The computational gain from the experiment compared with RAHGD and Clip RAHGD seems limited.
A3: In the experiments, the test accuracy for the adaptation from both the SVHN and MNIST-M to MNIST show that our proposed ANCGDA algorithm converge faster and have better convergence result compared with RAHGD[3] and Clip RAHGD for these two 28x28 small size picture datasets. It shows that the experiments results are consistent with the theoretical complexity gap between ANCGDA and RAHGD, which implies that the performance gap can be more obvious for the real-world datasets with large dimensions . We will conduct more experiments on large-scale datasets in our future work.
Q4: The formulation of Domain-Adversarial Neural Network given by (9) seems not right. There are no label for target domain. In the loss function contain for target domain. Do I miss something here?
A4: Sorry for omitting the details. We gives the missing details as follows, which will be added to our revised paper.
The target domain dataset only contains features. is a single-layer neural network as the feature extractor with the size of (28 28) 200 with parameter and is a two-layer neural network as the domain classifier with the size of 200 20 10 with parameter , followed by a cross entropy loss. For the logistic loss functions for , we let and .
Reference:
[1] Luo L, Li Y, Chen C. Finding second-order stationary points in nonconvex-strongly-concave minimax optimization[J]. Advances in Neural Information Processing Systems, 2022, 35: 36667-36679.
[2] Huang M, Chen X, Ji K, et al. Efficiently escaping saddle points in bilevel optimization[J]. arXiv preprint arXiv:2202.03684, 2022.
[3] Yang H, Luo L, Li C J, et al. Accelerating inexact hypergradient descent for bilevel optimization[J]. arXiv preprint arXiv:2307.00126, 2023.
Thank you for the detailed response. I keep my score.
Thank you very much for your constructive comments and suggestions!
Q1: While the extension to the general smooth case is interesting and important, the design of optimization algorithms relies on the combination of the Nesterov’s classical Accelerated Gradient Descent and tricks from Li and Lin (2022). Could the authors comment what are the main novelty on the algorithmic design compared with the existing literature?
A1: Thank you for your valuable comment. After sufficient Nesterov’s Accelerated Gradient Descent steps are performed with restart technique to ensure the function value of decrease monotonically, the negative curvature finding procedure that inspired by the idea of escaping saddle points is performed, which is the main novelty of the algorithmic design.
For the procedure of negative curvature finding procedure, we define a function in Eq. (80) such that , as shown in Appendix D, and estimate the derivative of with , which is attained in Line 15, and in the following epoch with iterations. Here, instead of the classical Nesterov’s Accelerated Gradient Descent steps in the epochs with , the algorithm first obtains (represented as in Line 5) with and then updates in Line 19 by for every iteration in this epoch, where the algorithm try to find a negative curvature direction around with a radius of . Then after iterations, Lemma 4.7 implies that the unit vector is a negative curvature direction with high possibility and the value of function will decrease by the one-step descent along the direction in Line 22 according to Lemma 4.7. Further details of the proof can be found in Appendix D.
Q2: The proofs seem to heavily depend on the previous work Luo et al. (2022), Huang et al. (2022) and Yang et al. (2023) which assume Lipschitz Hessian condition. The proof novelty can be incremental. The authors summarised the main challenges in Section 4.1. It would be helpful if the authors describe in the proof sketch which lemmas are main novel ones.
A2: Thank you for your constructive comment. We will add the following discussion to the revised paper.
Firstly, the main novelty compared with Luo et al. (2022) [1], Huang et al. (2022) [2] and Yang et al. (2023) [3] is that we proposed a generalized Lipschitz Hessian assumption for the objective function of minimax optimization in Definition 3.4 that is much weaker than the classical Hessian Lipschitz assumption in the previous works and under such assumptions we further proposed Lemma 4.2, which is a key technical lemma to prove that the primal function and is still well-defined under the relaxed smoothness condition on and analyse the structure and the generalized smoothness properties on and , which depends on the generalized smoothness constants of and . It is quite significant because compared with Huang et al. (2022) [2] and Yang et al. (2023) [3], the analysis on the error term of the hypergradient estimator become much difficult under generalized smoothness, whose upper bound depends on . This quantity is difficult to handle because can be large, and it is difficult to decouple the two measurable term and .
Also, in Lemma 4.7 and 4.8, we introduced two important lemmas for escaping saddle points by negative curvature direction finding technical, which guarantee the algorithm can find second-order stationary points with high possibility under the generalized hessian continuous assumption in . Compared with the results in Yang et al. (2023) [3], it improves the complexity with even under the much weaker assumptions.
We have conducted three numerical experiments to compare the performance of our algorithm with PRAHGD for the efficiency of escaping saddle points on . In this table we gives the number of iterations for the two algorithms to make sure 90% of samplings have a descent value of , with , and , respectively. All the three functions have a saddle point at . Each experiment is conducted with same and for 300 samples that generated in a Gaussian ball around . We will add these experiments in the final version of the paper.
| Method | (Iterations) | (Iterations) | (Iterations) |
|---|---|---|---|
| PRAHGD | 45 | 72 | 24 |
| ANCGDA (Ours) | 19 | 22 | 6 |
This paper provides the first first-order algorithm for non-stochastic generalized-smooth nonconvex-strongly-concave minimax optimization that achieves iteration complexity even better than SOTA for Lipschitz-smooth setting.
优点
As shown in my summary, the claim of both weaker smoothness assumption and stronger result (lower iteration complexity) is significant.
缺点
Some points need to be clarified as shown below, especially the question 1 below about intuition of algorithm design. A possible proof error may invalidate the main convergence result as shown in question 14 below.
问题
(1) In line 5 of Algorithm 1, why can help get rid of saddle point? At the end of Algorithm 1, why is the NC direction? In the revised paper, could you provide intuitive explanations of these questions or cite the works that contains such explanations?
(2) In line 88, by "Speically", do you mean "Specially" or "Specifically"?
(3) In line 103, do you mean to change to ?
(4) In line 216, "with Eq. (2)".
(5) In Eq. (5) of Lemma 4.2, can be simplified to ?
(6) In line 320, "reset to be an epoch" might be read together. Do you mean "we define an epoch to be a round from to the iteration that triggers one of these conditions and resets k to 0"?
(7) In line 325, does "hypergradient estimation of " mean "estimation of the hypergradient "?
(8) In line 331: "an second-order" to "a second-order".
(9) You could define around its first appearance.
(10) In Lemma 4.2, depends on but the conclusions do not depend on , why? Can be any arbitrary point? Should we add to the end of the equation of item 4, based on Eq. (36)? Adding this also will make the conclusion consistent with the inequality in Definition 3.2 which contains .
(11) and are defined but not used in Lemmas 4.6 and 4.7.
(12) Right under Eq. (8), what are the roles of and (for examples, are they predictor and loss function respectively)? In Eq. (9), what are , and ?
(13) Some other hyperparameter choices like could be revealed in your experiment for reproducibility.
(14) Possible error in Proof of Lemma 4.6: Should the second be ? Since their distance is not surely to be smaller than , I am not sure if Lemma 4.6 holds.
Thanks for your constructive comments, and we have revised our paper accordingly! We have addressed the concerns about the questions as outlined below.
Q1: In line 5 of Algorithm 1, why can help get rid of saddle point? At the end of Algorithm 1, why is the NC direction? In the revised paper, could you provide intuitive explanations of these questions or cite the works that contains such explanations?
A1: Sorry for omitting the details. We provide a more detailed analysis as follows, which have been added to our revised paper.
Firstly, to analysis the procedure of escaping saddle points after sufficient descent for finding first-order stationary points, we define a function in Eq. (80) such that (while assuming by shifting so that is mapped to for convenience without loss of generality). Then we have the derivative of the function . So for the iteration that uniform perturbation is added, we can estimate the derivative of with and to update in the escaping saddle point procedure and the estimation error can be bounded by according to Lemma 4.4. Further details of can be founded in appendix D.
Then for the unit vector , Lemma 4.7 yeilds that for the point satisfying , after iterations of the negative curvature (NC) finding procedure with possibility of , the obtained satisfies , where stands for the Hessian matrix of function . Here, we take the definition of negative curvature direction from Xu et al. (2018) [1], which implies that for a non-degenerate saddle point of a function with and , the negative curvature direction satisfies and . Taking and yields that the obtained is a NC direction. Further proof of Lemma 4.6 and 4.7 can be found in appendix D and the NC finding procedure is inspired by Xu et al. (2018) [1] and Zhang et al. (2021) [2].
Q2, 3, 4, 6, 7, 8: There are some typos in the paper. Also, some expression is not proper and can be misread.
A: Thank you for catching the typos! We have corrected them in the revised version.
Q5, 9: In Eq. (5) of Lemma 4.2, can be simplified to ? Also, you could define around its first appearance.
A: Thank you for your suggestion! We have simplified the definition of and add the definition of in Lemma 4.2.
Reference:
[1] Xu Y, Jin R, Yang T. First-order stochastic algorithms for escaping from saddle points in almost linear time[J]. Advances in neural information processing systems, 2018, 31.
[2] Zhang C, Li T. Escape saddle points by a simple gradient-descent based algorithm[J]. Advances in Neural Information Processing Systems, 2021, 34: 8545-8556.
Q10: In Lemma 4.2, depends on but the conclusions do not depend on , why? Can be any arbitrary point? Should we add to the end of the equation of item 4, based on Eq. (36)? Adding this also will make the conclusion consistent with the inequality in Definition 3.2 which contains .
A10: Thanks for your constructive comments for Lemma 4.2. We apologize for the confusion. Here can be any arbitrary point as and \\mathbf{x}^\\prime satisfy the generalized smoothness assumptions and Eq. (6) is presented improperly. We corrected it with , where denotes the Euclidean ball with radius R centered at x, which is consistent with the further proof and conclusions. Also we added the missing term of in item 4. Thanks again for pointing this mistake.
Q11: and are defined but not used in Lemmas 4.6 and 4.7.
A11: Thank you for your suggestion. We removed the definition of in Lemma 4.7 and added it to the proof in appendix D. Also, we rewrite the Lemma 4.6 as follows:
Lemma 4.6. Running Algorithm 1 with parameters setting in Theorem 4.3. In the epoch that the condition on Line 11 triggers, the point in Line 13 satisfies .
Q12, 13: About the experiment. Right under Eq. (8), what are the roles of and (for examples, are they predictor and loss function respectively)? In Eq. (9), what are , and ? Also, some hyperparameter choices like , , , could be revealed in your experiment for reproducibility.
A: Sorry for omitting the details. We gives the missing details as follows, which have been added to our revised paper.
Here is a single-layer neural network as the feature extractor with the size of (28 28) 200 with
parameter and is a two-layer neural network as the domain classifier with the size of 200 20 10 with parameter , followed by a cross entropy loss. For the logistic loss functions for , we let and .
About the other hyperparameters for ANCGDA, RAHGD and Clipping RAHGD, we choose , and for both the source domain dataset while setting for SVHN as source dataset and for MNIST-M.
Q14: Possible error in Proof of Lemma 4.6: Should the second be ? Since their distance is not surely to be smaller than , I am not sure if Lemma 4.6 holds.
A14: We apologize for the confusion. In the algorithm, we actually reset at the end of each epoch, then the beginning of the proof for Lemma 4.6 should be as follows:
where holds because . So there is no term in Eq. (75) and the proof implies that Lemma 4.6 holds properly.
I'm satisfied with the response and increased my rating to 6.
By the way, Eq. (6) could be changed to with big parenthesis and .
Thanks for your efforts for elaboration and improvements.
Thank you very much for your kind reply! We have changed Eq. (6) in the revised paper according to your suggestions.
This paper studied the nonconvex-strongly-convex minimax under the generalized-smooth setting, and proposed a ANCGDA method to find a second-order stationary point in nonconvex-strongly-concave minimax optimization with generalized smoothness. It provided the convergence analysis of the proposed method under some conditions. It also provided some numerical experimental results on the proposed method.
优点
This paper proposed a ANCGDA method to find a second-order stationary point in nonconvex-strongly-concave minimax optimization with generalized smoothness. It provided the convergence analysis of the proposed method under some conditions. It also provided some numerical experimental results on the proposed method.
缺点
From algorithm 1 of this paper, the proposed algorithm basically extends the existing methods in [1] to minimax optimization. The novelty of the proposed algorithm is limited. Meanwhile, the convergence analysis mainly follows the exiting results.
[1] Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the complexity. In International Conference on Machine Learning, pp. 12901–12916. PMLR, 2022.
The numerical experiments in the paper are limited. The authors should add some numerical experiments such as GAN and multi-agent reinforcement learning.
问题
From Figure 1, what does the horizontal axis represent?
Under the Definition 3.4, the generalized Hessian continuous condition may be generate the smoothness condition ? If not, please give some negative examples.
What is the full name of ANCGDA in the paper?
About the contribution.
We furtherly clarify the contributions of the proposed method as follows.
Weaker assumptions. We first proposed a second-order theory of generalized smoothness condition for minimax optimization as shown in Definition 3.4, and conduct the new fundamental properties of the primal function and in Lemma 4.2 under the proposed second-order generalized smoothness condition, which is much weaker than the classical Lipschitz smoothness and Lipschitz Hessian continuous assumptions that are commonly used in previous works for both nonconvex optimization and minimax optimization like [1], [2], [3] and [4]. It extended the existing work to more general cases which can arise in various machine learning problem. To the best of our knowledge, this is the first work to show convergence for finding second-order stationary points on nonconvex minimax optimization with generalized smoothness.
Lower complexity. We achieves the iteration complexity of for finding a second-order stationary point in generalized smoothness nonconvex-strongly-concave minimax optimization, which is even better than SOTA for Lipschitz-smooth setting nonconvex-strongly-concave minimax optimization.
Better experimental results. We conduct a numerical experiment on domain adaptation task to validate the practical performance of our method. We show that ANCGDA consistently outperforms other minimax optimization algorithms.
Q2: What is the full name of ANCGDA in the paper? From Figure 1, what does the horizontal axis represent?
A2: Sorry for omitting the details. We have added the full name of the algorithm (Accelerated Negative Curvature Gradient Descent Ascent) in the title of the Algorithm 1. The horizontal axis represent iterations and we have added it to Figure 1 in the revised paper.
Q3: Under the Definition 3.4, the generalized Hessian continuous condition may be generate the smoothness condition ? If not, please give some negative examples.
A3: According to Xie et al. (2024) [5], the second-order generalized smoothness of nonconvex optimization can be interpreted from the perspective of the boundness of higher-order derivatives, as the original definition of first-order generalized smoothness comes from the boundness of second-order derivatives in Zhang et al. (2019) [6]. We agree with the explanations in cited works.
Q4: The numerical experiments in the paper are limited. The authors should add some numerical experiments such as GAN and multi-agent reinforcement learning.
A4: Thank you for your suggestion! In the domain adaptation experiments, the test accuracy for the adaptation from both the SVHN and MNIST-M to MNIST show that our proposed ANCGDA algorithm converge faster and have better convergence result compared with RAHGD[3] and Clip RAHGD for these two 28x28 small size picture datasets. It shows that the experiments results are consistent with the theoretical complexity gap between ANCGDA and RAHGD, which implies that the performance gap can be more obvious for the real-world datasets with large dimensions . We will conduct more experiments on large-scale datasets in our future work.
Reference:
[1] Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the complexity. In International Conference on Machine Learning, pp. 12901–12916. PMLR, 2022.
[2] Luo L, Li Y, Chen C. Finding second-order stationary points in nonconvex-strongly-concave minimax optimization[J]. Advances in Neural Information Processing Systems, 2022, 35: 36667-36679.
[3] Huang M, Chen X, Ji K, et al. Efficiently escaping saddle points in bilevel optimization[J]. arXiv preprint arXiv:2202.03684, 2022.
[4] Yang H, Luo L, Li C J, et al. Accelerating inexact hypergradient descent for bilevel optimization[J]. arXiv preprint arXiv:2307.00126, 2023.
[5] Xie C, Li C, Zhang C, et al. Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(14): 16049-16057.
[6] Zhang J, He T, Sra S, et al. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity[C]//International Conference on Learning Representations. 2019.
The authors basically did not deal with my concerns. I still concern the novelty of this paper. Meanwhile, the authors did not add any numerical experiments.
Definition 3.4 in the paper is meaningless. Following the generalized smoothness, this definition should be and so on. Clearly, Definition 3.4 in the paper is meaningless. I think that the authors use this meaningless definition to simply the convergence analysis in this paper.
Thus, I keep my score.
Thanks for your constructive comments. We have addressed the major concerns about the paper as outlined below.
Q1: From algorithm 1 of this paper, the proposed algorithm basically extends the existing methods in [1] to minimax optimization. The novelty of the proposed algorithm is limited. Meanwhile, the convergence analysis mainly follows the exiting results.
A1: Thank you for your valuable comment.
About the algorithmic design.
After sufficient Nesterov’s Accelerated Gradient Descent steps are performed with restart technique to ensure the function value of decrease monotonically, the negative curvature finding procedure that inspired by the idea of escaping saddle points is performed, which is the main novelty of the proposed algorithm.
For the procedure of negative curvature finding, we define a function in Eq. (80) such that , as shown in Appendix D, and estimate the derivative of with , which is attained in Line 15, and in the following epoch with iterations. Here, instead of the classical Nesterov’s Accelerated Gradient Descent steps in the epochs with , the algorithm first obtains (represented as in Line 5) with and then updates in Line 19 by for every iteration in this epoch, where the algorithm try to find a negative curvature direction around with a radius of . Then after iterations, Lemma 4.7 implies that the unit vector is a negative curvature direction with high possibility and the value of function will decrease by the one-step descent along the direction in Line 22 according to Lemma 4.7. Further details of the proof can be found in Appendix D.
About the convergence analysis.
Firstly, the main novelty compared with Luo et al. (2022) [2], Huang et al. (2022) [3] and Yang et al. (2023) [4] is that we proposed a generalized Lipschitz Hessian assumption for the objective function of minimax optimization in Definition 3.4 that is much weaker than the classical Hessian Lipschitz assumption in the previous works and under such assumptions we further proposed Lemma 4.2, which is a key technical lemma to prove that the primal function and is still well-defined under the relaxed smoothness condition on and analyse the structure and the generalized smoothness properties on and , which depends on the generalized smoothness constants of and . It is quite significant because compared to Huang et al. (2022) [3] and Yang et al. (2023) [4], the analysis on the error term of the hypergradient estimator become much difficult under generalized smoothness, whose upper bound depends on . This quantity is difficult to handle because can be large and it is difficult to decouple the two measurable term and . Leveraging by this important properties, we gives the analysis about the descent procedure shown in Lemma 4.5 and 4.6, which yield that algorithm can find first-order stationary points with suffcient small as condition on Line 11 triggers.
Then, in Lemma 4.7 and 4.8, we introduced two important lemmas for escaping saddle points by negative curvature direction finding technical, which guarantee the algorithm can find second-order stationary points with high possibility under the generalized hessian continuous assumption in . Compared with the results in Yang et al. (2023) [4], it improves the complexity with even under the much weaker assumptions.
About Definition 3.4.
In the Assumption 3 of Xie et al. (2024) [1], the authors gives a second-order theory of generalized smoothness for nonconvex optimization, that is,
where F is twice-differentiable with some constants and , interpreted from the perspective of the boundness of higher-order derivatives. So for minimax optimization we give a definition for generalized Hessian smoothness in definition 3.4. Also, it is reasonable because with the definition of -smoothness in Zhang et al. (2019) [2], which is commonly accepted as the globle generalized smoothness assumption in nonconvex optimization, which reduces to the classical L-smoothness when and , that is,
It shows the reasonableness to present the generalized Hessian smoothness for minimax optimization with respect to and with constants , , , and , , as the Definition 3.4 in our paper, instead of , and as you mentioned above.
About the experiment.
We have conducted three numerical experiments to compare the performance of our algorithm with PRAHGD for the efficiency of escaping saddle points on . In this table we gives the number of iterations for the two algorithms to make sure 90% of samplings have a descent value of , with , and
, respectively.
All the three functions have a saddle point at . Each experiment is conducted with same and for 300 samples that generated in a Gaussian ball around . We will add these experiments in the final version of the paper.
| Method | (Iterations) | (Iterations) | (Iterations) |
|---|---|---|---|
| PRAHGD | 45 | 72 | 24 |
| ANCGDA (Ours) | 19 | 22 | 6 |
Also, we think that the main novelty for our work is that we first proposed a second-order theory of generalized smoothness for minimax optimization while first guarantying the convergence to second-order stationary points in nonconvex-strongly-concave settings for our new algorithm in such relaxed smoothness assumptions while achieving the complexity that even better than SOTA results in classical Lipschitz-smooth settings. It is quite significant considering that all second-order stationary points of are (approximate) local minimax points of for nonconvex-strongly-concave minimax optimization. Relaxing the smoothness assumptions can extend the second-order convergence to more general cases of nonconvex minimax optimization and is more consistent with a class of well-studied nonconvex machine learning problem.
Reference
[1] Xie C, Li C, Zhang C, et al. Trust Region Methods for Nonconvex Stochastic Optimization beyond Lipschitz Smoothness[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(14): 16049-16057.
[2] Zhang J, He T, Sra S, et al. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity[C]//International Conference on Learning Representations. 2019.
Paper proposes a new class nonconvex-strongly-concave minimax problems with generalized first+second-order smoothness. Then it provides new algorithm and analysis which achieves new state-of-the-art iteration complexity under this assumption. Paper also provides neural network experiments however their rigor seems limited (e.g. no error bars). Authors are encouraged to improve clarity in writing so that novelty of their techniques and relevance of generalized second-order smoothness is better understood.
审稿人讨论附加意见
Authors were able to address most reviewers' concerns except the relevance of the second order generalized smoothness assumption. Authors argued that this assumption is justified by its use in a recent work (Xie et al., 2024) on minimization. However, unlike this prior work current manuscript fails to provide any problems where this assumption holds. Even for the additional synthetic experiments provided during discussion, authors did not show that the assumption holds without standard second-order smoothness assumption holding.
Reject