Escaping Saddle Point Efficiently in Minimax and Bilevel Optimizations
摘要
评审与讨论
The authors propose a perturbed stochastic gradient method for bi-level and minimax optimization. Crucially, the gradient complexity of their proposed methods (suppressing the condition number dependence) achieve \tilde{O}(\epsilon^{-3}) gradient complexity in order to find a second order critical point. This seems to match the best gradient complexity known among stochastic methods converging just to a critical point.
优点
The theoretical result is quite strong, polynomially improving upon the gradient complexity of the best known result (improving from \epsilon^{-4} to \epsilon^{-3}). I appreciate the inclusion of condition number dependence in the results as well. The example chosen for the numerical result is also quite illustrative and well chosen.
缺点
The main weakness is the presentation of the paper. I provide a few (small) comments here.
Minor comments:
-
The paper makes strong smoothness assumptions (which do seem to be standard in the literature), but it would be useful to include references to methods which do not require such strong assumptions, such as Chen, et. al (https://arxiv.org/pdf/2306.12067.pdf).
-
There are several small issues riddled throughout that should be resolved before publication. To give one such example (among several), Assumption 2 is not quite precise: There should be a quantifier on \xi, \zeta (e.g. I suppose this should hold for almost every \xi and \zeta).
-
The authors never define HV and JV as stated in Theorem 2. I assume these are the number of required Hessian and Jacobian vector products.
-
The description of the algorithm is fairly difficult to follow. I would recommend moving the second empirical result on hyper-representation learning to the appendix and perhaps using the extra space to more clearly explain the algorithm.
问题
-
This question is a bit broader than the scope of the paper, but answering it would help quite a bit in terms of clarity. The authors mention a lower bound of Zhang et. al (2021) which is achieved for deterministic algorithms by Lin, et. al (2020b). Could the authors clarify the situation on the lower bound in the specific setting they consider? E.g. with the smoothness assumptions imposed by Assumptions 2, 3 and strong convexity Assumption 1, are there known results for lower bounds on reaching a second order stationary point as considered here? It is interesting to improve upper bounds as done in this paper, but some guidance on lower bounds would either (i.) situate and clarify the results quite a bit if known or (ii.) strengthen the results significantly if not known.
-
Regarding the numerical experiments, the trajectory of the proposed algorithm exhibits some interesting behavior which would be nice to clarify. In particular, should the reader interpret the flat regions (e.g. Figure 1.(a) iterations 10^{4} – 2\cdot 10^{4}) as while the algorithm is trying to escape from a bad critical point? Moreover, there seem to be distinctions in the convergence behavior in the different phases. It is a bit hard to tell, but the proposed method seems to enjoy linear convergence to the first critical point and thereafter sublinear convergence behavior (with different rates of convergence, for instance at iterate 4*10^{4} in Figure 1.(a)). Could the authors clarify this a bit?
We appreciate the recognition of the contribution of our paper. We will reply to your questions as follows.
- Thank you for the suggestion of studying the lower bound for second-order stationary point. We can reach some conclusions based existing works. Since the hard instance constructed by Zhang et al (2021) automatically satisfies Assumption 3 and second-order stationary point is a special case of first-order stationary, the lower bound of second-order stationary point should be no smaller than first-order stationary point. In addition, the complexity of perturbed gradient method matches the result of searching first-order stationary point if ignoring logarithmic term, which indicates our method is near-optimal to achieve second-order stationary point.
- Yes, we think the flat region is probably the stage of escaping from a bad critical point. When is far away from a first-order stationary point, the algorithm will stay in the descent phase and the convergence is fast in this stage. When is approaching a first-order stationary point, perturbations and escaping phases will occur and sometimes they could be slower than the original descent phase, which can be a possible reason for the slow convergence after gradient oracles.
The paper proposes a new algorithm, PRGDA, that combines the ideas of LENA, a first-order algorithm for escaping saddle points, and SREDA, a variance reduction method for nonconvex-strongly-concave (NC-SC) minimax optimization. The authors provide convergence guarantees for the proposed algorithm. For stochastic NC-SC minimax optimization, this is the first first-order algorithm to achieve second-order convergence, and it requires gradient complexity to find an second-order stationary point. For stochastic NC-SC bilevel optimization, it achieves and gradient complexities for the upper and lower level functions, respectively. Further experiments are conducted to show the ability of the algorithm to find local minima instead of saddle points.
优点
-
The proposed PRGDA is the first first-order stochastic algorithm for NC-SC minimax optimization with a second-order convergence guarantee, and its complexity matches the best result for finding a first-order stationary point.
-
For bilevel optimization, the new method improves upon the complexity of existing methods.
缺点
-
I suggest that the author motivate and discuss in the paper why, in minimax optimization, we aim to find the local minimum of the primal function in the first place. In minimization problems, this is natural. However, in games, we care about equilibria. For instance, [1] discussed the significance of the local minimax point in this area, while [2] mentioned that a second-order stationary condition implies a local minimax point (Fact 1). Nevertheless, the relationship between saddle points of the primal function and the local minimax point remains somewhat unclear to me. Do they not intersect at all? Why should we escape these saddle points? What happens in bilevel optimization?
-
Although the work proposes the first first-order stochastic algorithm for NS-SC minimax optimization with second-order convergence and improves the complexity for bilevel problems, the techniques seem similar to existing methods, namely, LENA [3] and SREDA. Could the author elucidate the novelty in the algorithm design or proof techniques?
-
Regarding the experiments, how did the author choose the hyper-parameters? Were these hyper-parameters optimized for each algorithm? While the sensitivity of StocBio + iNEON to hyper-parameters is discussed, I am curious about the fairness of the comparison.
-
Some claims appear unsound:
- Theorems 1 and 2 should also include assumptions regarding noise. This is only mentioned in the appendix during the proof of these theorems. Additionally, the noise assumption (Equation 13) is not "bounded variance" but should be termed as bounded noise or bounded noise support, which is stronger than bounded variance.
- "PRGDA is the first algorithm that is guaranteed to obtain second-order stationary point for stochastic nonconvex minimax optimization problems." However, newer versions of the cited paper (Chen et al. (2021b)) introduce a stochastic version. This should also be reflected in Table 1; for instance, Cubic-GDA should be marked in the "Stochastic" field.
- Some references appear to be inaccurate, for example, "including intuitive methods SGDmax (Jin et al. (2019))", and "SGDmax (Jin et al. (2019)) is an intuitive double loop algorithm".
-
Some notations either are not introduced or are only formally defined in later sections, such as:
- "" in the abstract.
- "" is introduced early on but is only defined in section 3.
- "SFO" is not defined. I assume it stands for "stochastic first-order oracle".
- "" and "" in Theorem 2 are not defined.
References
[1] Jin, Chi, Praneeth Netrapalli, and Michael Jordan. "What is local optimality in nonconvex-nonconcave minimax optimization?." International conference on machine learning. PMLR, 2020.
[2] Chen, Ziyi, Zhengyang Hu, Qunwei Li, Zhe Wang, and Yi Zhou. "A Cubic Regularization Approach for Finding Local Minimax Points in Nonconvex Minimax Optimization." Transactions on Machine Learning Research. 2020.
[3] Chen, Zixiang, Dongruo Zhou, and Quanquan Gu. "Faster perturbed stochastic gradient methods for finding local minima." International Conference on Algorithmic Learning Theory. PMLR, 2022.
问题
Could the author clarify the concerns in Weaknesses 1-3?
Thank you for the effort to review our work and we will address your concerns as follows.
- Thank you for the suggestion of discussing the relation between saddle point of and local minimax point. Here we will prove that under Assumption 1 and Assumption 2, a saddle point of is not a local minimax point. Suppose is a saddle point of , then for , there exists such that and . Let and . By Proposition 1 in Appendix F we have . As and are maximizers for and , we know and since . Therefore, we have according to the definition of , which implies that is not a local minimax point.
- We summarize some fundamental challenges as follows. Firstly, the estimation of is different to the original minimax optimization as we introduce the perturbation and adopt a larger stepsize in the escaping phase. It should be estimated in these new situations. Secondly, since in minimax problem also produces noise rather than just one variable , the noise of gradient estimator is larger than conventional single-level optimization. Meanwhile, the noise is also influenced by the parameters of the escaping phase (such as and ) as we have mentioned, which indicates that the selection of parameters is harder than the work of the LENA algorithm. Thirdly, the dependence of is not included in related works of Perturbed GD and its variant. But we need to analyze the relation between the parameters of escaping phase and and provide proper options.
- In StocBiO + iNEON, a subroutine iNEON is run to compute the negative curvature. In each inner loop iteration of this subroutine, the full batch loss function value is estimated approximately via the stochastic minibatch. In theoretical analysis the batchsize is , which makes the estimation precise. However, it is too expensive in practice. The precision of this estimation is crucial to StocBiO + iNEON, and hence it suffers poor performance if the batchsize used for this estimation is not large enough. In our experiment, all regular batchsize is fixed as 40. In PRGDA, a large batch of is loaded periodically for q = 25. The averaged batchsize is small, while StocBiO + iNEON needs the large batch in every iteration in the subroutine. In our experiment, the batchsize used for estimating the loss function value in iNEON is set to 40, in which condition StocBiO + iNEON fails to escape saddle point. We conduct some additional experiments to increase the batchsize to 1/10 full batch and StocBiO + iNEON still fails to escape saddle point. When we increase the batchsize to 1/4 full batch and other hyperparameters keep the same, StocBiO + iNEON escapes saddle point and achieves the ratio of distance of 0.42 after gradient oracles and 0.29 after gradient oracles when d = 100. The result is still slower than PRGDA in Figure 2 (c).
This paper proposes a stochastic first-order algorithm called PRGDA for nonconvex-strongly-concave minimax optimization. For bilevel optimization in particular, the authors prove convergence to a second-order stationary point with a gradient complexity of , which improves upon the previous best result in Huang et al. 2022, which achieved a complexity of .
优点
I think the main strength of this paper is a theoretical improvement of the gradient complexity, as I've summarized above.
缺点
While I think the theoretical result in this paper is interesting, there are a few reasons that prevent me from giving this paper a higher score.
- The overall presentation is not clear. Specifically, in sections 4 and 5 where algorithm 1 is introduced, the description is very dense and difficult to parse. The authors refer to quite a few previous algorithms such as SREDA, PiSARAH and SPIDER without actually giving a brief summary of what these algorithms do. Also missing from this section is a highlight of what makes PRGDA different from the previous SOTA method in Huang et al. 2022.
- Another major weak point in the presentation is a lack of clearer comparison with prior work. In tables 2 and 3, it is unclear to me if most of these results are actually comparable, since I'm not sure if they use all the assumptions 1-5 in this paper. In addition, the related work is scattered throughout the whole paper, and many prior algorithms are named, but not described at all. Obviously the authors do not need to describe all prior work in detail, but I think it is important to highlight what makes PRGDA from prior algorithms except from stochasticity and a simple perturbation.
- Section 6 is called convergence analysis, so i expected this section to include a discussion of the theoretical innovations of PRGDA that allows the authors to prove a better gradient complexity. However, there is no convergence analysis at all. Instead, only the two main theorems are stated, without any further explanation. This makes it hard to gauge how significant the theoretical guarantees are. For instance, in section 2.3 the authors claim that perturbed GD in the deterministic and stochastic settings are totally different. However, this is not the case at least in Jin et al. [1], where the proof for GD and SGD are quite similar. The analysis for GD and SGD might be more different in bilevel and minimax optimization, but I think it needs to be spelled out in more detail.
Overall I think the actual contributions of this paper are a bit hard to see because of the presentation. I suggest the authors spend more space to clarify difference with prior work and spell out the innovations in both your algorithm and proof technique.
[1] Jin, Chi, et al. "On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points." Journal of the ACM (JACM) 68.2 (2021): 1-29.
问题
Please see the weaknesses section.
Thank you for pointing out the problems of our presentation. We will improve our presentation and address your concerns as follows.
- In our final version, we will refer to the SREDA algorithm in the Appendix and provide more details to ensure that readers can compare our work with previous works more easily. The most significant difference between StoBiO + iNEON and our work is that StoBiO + iNEON applies a subroutine (iNEON) to compute the negative curvature when reaching a first-order stationary point, while our method adds a perturbation. A weakness of StoBiO + iNEON is that large batch of is required to estimate the loss function value in every iteration in the iNEON subroutine. In contrast, our method only requires an averaged batchsize of . Our experimental results show that StoBiO + iNEON cannot escape saddle point effectively with small batch, under which setting our method succeeds to find second-order stationary point.
- In our final version, we will specify the assumptions that is used for each method in Table 2 and 3. For example, in Table 3 Acc-MDA and SREDA require Assumption 1 and 2. SGDA requires Assumption 2 (more complexity) or Assumption 1 and 2 (less complexity). Our PRGDA (minimax) requires Assumption 1, 2 and 3. The assumption of Lipschitz continuous second-order derivative is common in previous works that analyze the second-order optimality, including PGD, SSRGD, Cubic-GDA, MCN and StocBiO + iNEON. Besides, without the escaping phase, algorithms such as SREDA still cannot reach second-order stationary point even Assumption 3 is hold (considering the example that GD is trapped at saddle points).
- We summarize some fundamental challenges as follows. Firstly, the estimation of is different to the original minimax optimization as we introduce the perturbation and adopt a larger stepsize in the escaping phase. It should be estimated in these new situations. Secondly, since in minimax problem also produces noise rather than just one variable , the noise of gradient estimator is larger than conventional single-level optimization. Meanwhile, the noise is also influenced by the parameters of the escaping phase (such as and ) as we have mentioned, which indicates that the selection of parameters is harder than the work of Pullback algorithm. Thirdly, the dependence of is not included in related works of Perturbed GD and its variant. But we need to analyze the relation between the parameters of escaping phase and and provide proper options.
The submission introduces PRGDA, a novel algorithm designed for nonconvex-strongly-concave minimax and nonconvex-strongly-convex bilevel optimization problems. PRGDA stands out by avoiding the computation of second-order derivatives. The authors claim this algorithm achieves state-of-the-art gradient complexity for second-order stationary points in stochastic nonconvex minimax problems and demonstrates improved performance in stochastic nonconvex-strongly-convex bilevel optimization.
The paper received mixed reviews, with general appreciation for the theoretical advancements but concerns regarding the presentation, experimental setup, and novelty of the proposed method. The theoretical foundation of PRGDA, particularly its efficiency in gradient complexity, is acknowledged as a strong point. Reviewers recognize the significant theoretical improvement over existing methods. However, concerns were raised about the assumptions made in the theorems and a lack of detailed convergence analysis. Reviewer tUHe pointed out that the paper's claims of novelty might be overstated, especially in light of similarity to existing methods like LENA and SREDA.
The paper's presentation was critiqued by multiple reviewers. Reviewer JNob found the algorithm's description dense and difficult to follow. The comparison with prior work was found lacking, with a need for a clearer exposition of what sets PRGDA apart from existing methods.
The paper's experimental section received criticism for its setup and the selection of hyperparameters. Reviewer tUHe questioned the fairness in the comparison of PRGDA with other algorithms, particularly regarding the choice and optimization of hyperparameters. The results, while demonstrating PRGDA's efficiency, were not convincing enough due to the aforementioned issues in experimental design and comparison.
The paper presents notable theoretical advancements in escaping saddle points in minimax and bilevel optimizations. However, its presentation, experimental evaluation, and clarity in distinguishing its contributions from existing works need significant improvement. Given the mixed reviews and identified weaknesses, the paper should undergo substantial revisions, particularly in addressing the concerns raised in the clarity of presentation and experimental setup. The theoretical strengths of the paper are promising, but these need to be better contextualized and supported by a more robust experimental framework and a clearer exposition.
为何不给更高分
Weaknesses outweigh the strengths
为何不给更低分
N/A
Reject