6.5

/10

Poster4 位审稿人

最低6最高8标准差0.9

3.5

置信度

正确性3.0

贡献度2.5

表达3.5

NeurIPS 2024

An Accelerated Gradient Method for Convex Smooth Simple Bilevel Optimization

Jincheng Cao,Ruichen Jiang,Erfan Yazdandoost Hamedani,Aryan Mokhtari

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

In this paper, we focus on simple bilevel optimization problems, where we minimize a convex smooth objective function over the optimal solution set of another convex smooth constrained optimization problem. We present a novel bilevel optimization method that locally approximates the solution set of the lower-level problem using a cutting plane approach and employs an accelerated gradient-based update to reduce the upper-level objective function over the approximated solution set. We measure the performance of our method in terms of suboptimality and infeasibility errors and provide non-asymptotic convergence guarantees for both error criteria. Specifically, when the feasible set is compact, we show that our method requires at most $\mathcal{O}(\max\\{1/\sqrt{\epsilon_{f}}, 1/\epsilon_g\\})$ iterations to find a solution that is $\epsilon_f$-suboptimal and $\epsilon_g$-infeasible. Moreover, under the additional assumption that the lower-level objective satisfies the $r$-th Hölderian error bound, we show that our method achieves an iteration complexity of $\mathcal{O}(\max\\{\epsilon_{f}^{-\frac{2r-1}{2r}},\epsilon_{g}^{-\frac{2r-1}{2r}}\\})$, which matches the optimal complexity of single-level convex constrained optimization when $r=1$.

关键词

accelerated gradient methodsbilevel optimization

评审与讨论

审稿意见

评分: 6置信度: 32024-06-25

This work proposes a novel optimization algorithm with improved iteration complexity for convex smooth simple bilevel optimization, and demonstrates its faster convergence using experiments.

优点

The presentation is very clear. The literature review looks comprehensive. The algorithm and complexity results look reasonable.

缺点

This work focuses on simple bilevel optimization with both levels being convex and under deterministic setting (with access to full gradients instead of stochastic gradients), so the scope is not wide. Also, as shown in Table 1, the complexity results outperform existing ones only a little since the complexity order is the same as [8] for $r=1$ , and the complexity dependence on $\epsilon_g^{-\frac{2r-1}{2r}}$ is worse than $\epsilon_g^{-0.5}$ in [16].

问题

(1) In line 27, you may write down the math formulation for ``more general settings with parameterized lower-level problems'', or refer this formulation to appendix.

(2) How did you obtain Eq. (5), the condition about $g_k$ ? How to guarantee Eq. (5) in implementation with unknown $g^*$ , $x^*$ , $L_g$ ?

(3) Does Lemma 4.3 require Assumption 4.1? If yes, add Assumption 4.1 to Lemma 4.3.

局限性

In the checklist, the authors mention their limitation of compact domain assumption. There is no societal impact of this theoretical work.

作者回复

2024-08-07

We thank the reviewer for the comments and great questions!

W. This work focuses on simple bilevel optimization with both levels being convex and under deterministic setting (with access to full gradients instead of stochastic gradients), so the scope is not wide. Also, as shown in Table 1, the complexity results outperform existing ones only a little since the complexity order is the same as [Samadi et al. (2023)] for $r = 1$ , and the complexity dependence on $\epsilon_{g}^{-\frac{2r-1}{2r}}$ is worse than $\epsilon_{g}^{-0.5}$ in [Chen et al. (2024)].

R. We would like to mention that most of the prior works focus only on either deterministic setting or stochastic setting, as the algorithms designed for one of the settings are not easy to be extended to the other setting. One of the works in stochastic simple bilevel optimization is [Cao et al. (2023)]. The strategy in [Cao et al. (2023)] is designed to manage the errors between the actual gradients and their estimates, as well as the errors between the true function values and their estimates. This methodology could be adapted for our AGM-BiO algorithm. However, we are uncertain if such an extension would lead to an acceleration. Another challenge lies in selecting appropriate gradient and function value estimates, crucially impacting the sample complexities based on the errors between the actual gradients and their estimates. Therefore, adapting our algorithm to the stochastic setting requires a careful choice of these estimates and a comprehensive convergence analysis. Indeed, this presents a compelling direction for future research.

Under the Hölderian error bound assumption, our result is the same as the result in [Samadi et al. (2023)] when $r = 1$ . However, [Samadi et al. (2023)] did not provide any results for $r > 1$ . Note that [Chen et al. (2024)] is a concurrent work. The convergence rate in [Chen et al. (2024)] is $\mathcal{O}(\epsilon_{f}^{-0.5r}) + \mathcal{O}(\epsilon_g^{-0.5})$ , while our convergence rate is $\mathcal{O}(\max(\epsilon_{f}^{-\frac{2r-1}{2r}},\epsilon_{g}^{-\frac{2r-1}{2r}}))$ . If $\epsilon_f = \epsilon_g$ , our rate is better than theirs. Only when $\epsilon_g < \epsilon_f^{r/2}$ , their result is better than ours. Furthermore, their results require the Lipschitz continuity of the upper-level objective (Assumption 2.1.1 in [Chen et al. (2024)]) and the compactness of the domain (a hidden assumption of Lemma 3.2 in [Chen et al. (2024)]), whereas our results under the Hölderian error bound assumption do not require either of these conditions.

We will add a remark about the above discussion to the revised paper. Thanks for your feedback.

Q1. In line 27, you may write down the math formulation for 'more general settings with parameterized lower-level problems', or refer this formulation to appendix.

A1. Thanks for pointing this out. We will add a formulation for the general bilevel problem in our revised vision.

Q2. How did you obtain Eq. (5), the condition about $g_k$ ? How to guarantee Eq. (5) in implementation with unknown $g^{\*}$ , $x^{\*}$ , and $L_g$ ?

A2. As we mentioned in line 151, we can generate this sequence $\{ g_k \}$ independently from the main algorithm by applying an accelerated projected gradient method to the lower-level problem $\min_{z\in \mathcal{Z}}~g(z)$ . The Eq. (5) is the convergence result of the accelerated projected gradient method for solving the single-level problem [Nesterov (2018)]. To perform the single-level accelerated gradient method on $g$ , we do not need to know the $g^\*$ and $x^\*$ . Note that $g$ has to be convex and $L_g$ -smooth to guarantee the Eq. (5) holds which are also the assumptions of our method. Hence, given our assumptions, we are able to generate a sequence $g_k$ satisfying the Eq. (5). We will add a remark about this point and provide necessary reference for the convergence rate.

Q3. Does Lemma 4.3 require Assumption 4.1? If yes, add Assumption 4.1 to Lemma 4.3?

A3. No. Lemma 4.3 does not require the Assumption 4.1. Assumption 4.1 is only required for Theorems 4.4 and 4.5.

References:

Samadi, S., Burbano, D. and Yousefian, F., 2023. Achieving optimal complexity guarantees for a class of bilevel convex optimization problems.
Chen, P., Shi, X., Jiang, R., and Wang, J., 2024. Penalty-based methods for simple bilevel optimization under holderian error bounds.
Nesterov, Y., 2018. Lectures on convex optimization.

评论- Reviewer fv8w's further query on Q2

2024-08-09

Hello, authors.

For Q2, does $g_k$ denote the function value from the k-th iteration of the Nesterov's gradient method on $g$ ?

{ $g_k$ } is input of Algorithms 1 and 2. Could you remove that input and give the procedure of obtaining $g_k$ in the algorithm body part?

Reviewer fv8w

2024-08-10

Thank you for the follow-up questions!

Q. For Q2, does $g_k$ denote the function value from the k-th iteration of the Nesterov's gradient method on $g$ ? $\\{g_k\\}$ is input of Algorithms 1 and 2. Could you remove that input and give the procedure of obtaining $g_k$ in the algorithm body part?

A. The reviewer is correct: the function value of the $k$ -th iteration of the Nesterov accelerated gradient method on $g$ is denoted by $g_k$ . However, this is not the only possible choice. In fact, any $\\{g_k\\}$ satisfying Eq. (5) can also be used as an input of our algorithms.

As you mentioned, this input can be removed. Instead, we could obtain the $g_k$ at the beginning of the $k$ -th iteration in our algorithm body part and use it to construct our approximated feasible set $\mathcal{X}_{k}$ . We will add an instantiation of our algorithm with Nesterov accelerated gradient iterates to the paper, which will be included in the appendix.

To clarify, there are multiple choices for the sequence $g_k$ . For instance, one could consider a constant sequence where $g_k$ is set to the function value of the last iterate of AGD for all $k$ . This choice has the benefit of shaving a logarithmic factor from the convergence rate, as discussed in Remark 4.2.

评论- The answer to Q2 is clear now.

2024-08-10

Thanks authors.

The answer to Q2 is clear now.

Reviewer fv8w

审稿意见

评分: 6置信度: 32024-07-01

The paper works on the problem of simple convex smooth bilevel optimisation, where ''simple'' means single-variable. The paper achieves the optimal rate for this problem by a combination of Nesterov's acceleration and Jiang-Abolfazli-Mokhtari-Hamedani's cutting-plane method.

优点

STRENGTHS.

The paper achieves an optimal rate for an important problem class.
The paper is very well-written.

缺点

问题

QUESTIONS.

What limitations do the authors anticipate in extending this technique (acceleration + cutting-plane) to the non-simple case (i.e., when you have an additional variable $y$ in the upper level objective, defined as the optimizer of a parametrized lower-level problem).

局限性

N/A since it's a theory paper

作者回复

2024-08-07

Thank the reviewer for the insightful question!

Q1. What limitations do the authors anticipate in extending this technique (acceleration + cutting-plane) to the non-simple case (i.e., when you have an additional variable in the upper level objective, defined as the optimizer of a parametrized lower-level problem).

A1. First of all, most of the work on general bilevel problems requires strong convexity for the lower-level function. In the case of the simple bilevel problem, this assumption would make the feasible set a singleton, rendering the problem trivial since it would amount to solving the lower-level problem only.

The general bilevel problem, when featuring a convex lower-level problem, is widely recognized as challenging in its most generic form. In this case, the optimal solution set of the lower-level problem will change as the upper-level variable changes with each iteration. Specifically, following a similar construction of the half-space, we cannot ensure that our constructed set always contains the optimal solution set of the bilevel problem, which is a key property in our proof. Therefore, to enhance computational tractability and adapt our framework to address the general bilevel problem effectively, we may need to introduce some additional assumptions, which is a compelling direction for future research.

评论- Thank you for your response

2024-08-07

Dear authors,

Thank you for your clear explanation! I acknowledge reading your response and am happy to maintain my score and advocate for acceptance of your paper.

Best wishes!

2024-08-10

Thank you for your response and advocating for the acceptance of our paper!

审稿意见

评分: 8置信度: 42024-07-10

The paper introduces a new algorithm called AGM-BiO (Accelerated Gradient Method for Bilevel Optimization) for solving simple bilevel optimization problems where both the upper and lower level objectives are convex and smooth.

优点

The paper is well-written and easy to follow. The assumptions are clearly stated. The dependence of the convergence rates on everything seems to be explicitly written out.
The proposed algorithm seems to be easy to implement, and it achieves the best-known complexity bounds for both suboptimality and infeasibility in the considered settings.
Experiments are conducted to validate the strength of the proposed algorithm.

缺点

You provide a modified algorithm (Algorithm 2) that could potentially handle this composite structure. In line 629-line 631 you mention that you can derive identical complexity results for Algorithm 2 in either the compact domain setting or with the Hölderian error bounds on g. I guess formally stating the convergence result in a theorem would strengthen the paper and provide a clear reference point for readers interested in the composite case.

问题

局限性

Yes.

作者回复

2024-08-07

We thank the reviewer for the comment and the positive feedback!

W. You provide a modified algorithm (Algorithm 2) that could potentially handle this composite structure. In line 629-line 631 you mention that you can derive identical complexity results for Algorithm 2 in either the compact domain setting or with the Hölderian error bounds on g. I guess formally stating the convergence result in a theorem would strengthen the paper and provide a clear reference point for readers interested in the composite case.

R. As we mentioned in the Appendix B, Algorithm 2 designed for proximal setting can achieve the same results as Algorithm 1 in either the compact domain setting or with the Hölderian error bounds on g. The major difference lies in the main Lemma A.1. Therefore, we provided a detailed proof of a new Lemma B.1 for the composite setting. By replacing Lemma A.1 with Lemma B.1, all the proofs in our paper will hold. As the reviewer suggests, we will add the formal theorems for the composite setting in the revised version.

审稿意见

评分: 6置信度: 42024-07-12

This work proposes an accelerated method to solve convex simple bilevel problems, with the author providing both theoretical and numerical guarantees of the algorithm's convergence.

优点

This work makes a theoretical contribution to the convergence analysis, and the algorithm demonstrates an advanced convergence rate under the Hölderian error bound.
The analysis are detailed and concrete. The whole work is easy to follow.

缺点

Since the main techniques in this work, such as the accelerated gradient method and the cutting plane approach, have already been studied, the contribution of the proposed methods is incremental, and the convergence rate may not be difficult to prove based on existing studies.

问题

What is the convergence rate without using the accelerated gradient method, relying only on the projection onto the cutting plane?
Since the accelerated gradient method has been well studied recently, what is the unique challenge of applying it to simple bilevel problems?
Will this algorithm demonstrate any advantage in convergence without the assumption of the Hölderian error bound?
How do you find a feasible $X_k$ in practice? Is an additional loop required to achieve this?

局限性

No Limitation

作者回复

2024-08-07

We thank the reviewer for the feedback and valuable questions!

R. To address your concern in the weakness section, please refer to our answers to your first two questions. As stated in A1, the non-accelerated version of the projection-based algorithm yields unsatisfactory results. Thus, the accelerated gradient method is essential and significantly improves the convergence rates. In A2, we emphasize that constructing the correct approximated set is not as straightforward as in [6]. It requires a careful selection of the norm vector of the half-space, which can be interpreted as the descent direction of the lower-level objective $g$ .

Furthermore, the proof of the convergence results differs from that for the single-level counterpart. Specifically, in the proof of the main descent Lemma A.1, we employed different potential functions for the upper- and lower-level objectives, carefully selected the coefficients $A\_k$ and $a_k$ , and utilized the properties of the constructed approximated set several times. In Lemma 4.3, we characterized the convergence rate by considering a weighted sum of the upper- and lower-level functions. We then presented our formal theorems under the Hölderian error bound by selecting the appropriate weight. However, this weight does not appear explicitly in the algorithm, as it can be controlled by the step size.

A1. That is a great question. Our initial idea was to develop a projection-based algorithm without using the acceleration technique to solve simple bilevel problems. However, the non-accelerated version of the algorithm did not produce satisfactory results. Specifically, it achieves $\mathcal{O}(1/K)$ for the upper-level objective but fails to provide any convergence guarantee for the lower-level objective. On a related note, we would like to mention that a recent work [Devanathan and Boyd (2024)] proposed the Alternating-update Polyak Minorant Method (PMM), which is similar to the non-accelerated simple bilevel algorithm relying only on the cutting plane. It involves projecting onto the approximated sublevel sets of the upper- and lower-level objectives alternatively, assuming $f^{\*}$ is known. However, they only provide asymptotic convergence without any rate guarantees under common assumptions. To some extent, this also demonstrates the challenge of solving simple bilevel problems and the complexity of integrating the cutting plane technique with projection-based methods.

A2. A naive implementation of AGM for simple bilevel optimization requires access to the lower-level solution set $\mathcal{X}_g^\*$ , which is typically not available in closed form. Therefore, one must either use regularization by mixing the two objective functions or approximate the lower-level solution set.

The former approach, as studied by [8], involves determining a proper regularization parameter, which is challenging in practice and only provides a $\mathcal{O}(1/K)$ convergence guarantee on both levels without further assumptions even with the accelerated gradient method.

For the latter approach, the design of the approximated lower-level solution set $\mathcal{X}_k$ is a nuanced task. If we project onto a set much larger than $\mathcal{X}_g^{\*}$ , then the iterate may deviate from the optimal solution set, leading to a large error for the lower-level objective.

Furthermore, as we mentioned in Remark 3.1, there are three intertwining sequences $\\{x_k\\}$ , $\\{y_k\\}$ , and $\\{z_k\\}$ in AGM, and depending on where we linearize the objective function, various alternative formulations of halfspaces could contain the optimal solution set of the lower-level problem $\mathcal{X}_{g}^{\*}$ , such as $\\{z \in \mathcal{Z}: g(x_k) + \langle\nabla g(x_k), z-x_k\rangle \leq g_k\\}$ and $\\{z \in \mathcal{Z}: g(z_k) + \langle\nabla g(z_k), z-z_k\rangle \leq g_k\\}$ . However, our choice of using the gradient at $y_k$ to construct the halfspace is not arbitrary but essential for showing convergence guarantees for the lower-level objective.

For single-level optimization problems, we can project onto the feasible set in each iteration to keep all the iterates feasible. However, for simple bilevel optimization, the feasible set $\mathcal{X}_g^{\*}$ may not have an explicit form. By projecting onto the approximated set $\mathcal{X}_k$ , the iterates could be infeasible, introducing additional error in the lower-level objective convergence analysis. As a result, the condition that $f(x_k) \geq f^{\*}$ , which always holds in single-level problems, can be violated in bilevel problems. This also leads to challenges in our convergence analysis, and we cannot achieve the desired results for the lower-level objective unless $f(x_K) \geq f^{\*}$ .

A3. As we showed in Table 1, without the Hölderian error bound assumption, our method achieves the complexity $\mathcal{O}(\max(\frac{1}{\epsilon_f^{0.5}}, \frac{1}{\epsilon_g}))$ , which is better than the others under similar settings.

A4. The constructed set $X_k$ has an explicit form, i.e., $X_k \triangleq \\{ z \in \mathcal{Z}: g(y_k)+\langle \nabla g(y_k),z-y_k \rangle \leq g_k\\}$ , which always contains the lower-level problem solution set $\mathcal{X}_g^*$ (the feasible set of the upper-level problem) as stated in line 154-158.

How to project onto the set $X_k$ is another question you may be interested in. In some cases, such as our over-parameterized regression problem, $X_k$ is the intersection of an $L_2$ ball and a half-space, for which a closed-form solution exists to find the projected iterates $z_k$ . In other cases, such as our linear inverse problem, we may not be able to find $z_k$ directly. Instead, we can solve the projection subproblem using Dykstra's projection algorithm [43], as mentioned in line 701. For this scenario, an additional loop is needed to solve the subproblem.

Reference:

Devanathan, N. and Boyd, S., 2024. Polyak Minorant Method for Convex Optimization

2024-08-13

Thanks for the authors' response. My concerns and questions are solved so I will raise my score to 6. I am still curious about why the non-accelerated algorithm fails to provide any convergence guarantee for the lower-level objective.

2024-08-13

Thank you for the response. We're glad your questions and concerns have been addressed!

Regarding your further query, specifically, the non-accelerated version of our algorithm follows the update rule $x_{k+1} = \Pi_{X_k}(x_k - \eta_k \nabla f(x_k))$ , where the set $X_k$ is similarly constructed from a cutting plane as $X_k = \\\{z \in Z: g(x_k) + \langle \nabla g(x_k), z-x_k \rangle \leq g_k \\\}$ . To upper bound the lower-level objective, we can apply a similar analysis as in our Lemma A.1, leading to: $g(x_{k+1}) \leq g(x_k) + \langle \nabla g(x_k), x_{k+1} - x_k\rangle+ \frac{L_g}{2}\\|x_{k+1} - x_k\\|^2 \leq g_k + \frac{L_g}{2}\\|x_{k+1} - x_k\\|^2,$ where the first inequality used the $L_g$ -smoothness of $g$ , and the second inequality follows from the fact that $x_{k+1} \in X_k$ . However, the main challenge is controlling $\\|x_{k+1} - x_k\\|^2$ (see also Remark 4.1 for a related issue in our accelerated methods). If we follow the same strategy as in Theorem 4.1 and upper bound $\\|x_{k+1} - x_k\\|^2 \leq D^2$ by the compactness of $Z$ , this results in $g(x_{k+1}) \leq g_k + \frac{L_g}{2}D^2$ , which fails to provide a convergence rate for the lower-level objective. Therefore, it appears that acceleration is crucial to achieve the $O(1/k)$ rate for the lower-level objective reported in Theorem 4.1.

That said, we should clarify that this does not rule out the possibility of proving a convergence rate for the lower-level objective under the standard assumptions, though this appears non-trivial and would require a different analysis from the one presented in our paper. Thank you for your question, and we will further explore this topic in future research.

最终决定Accept (poster)

2024-09-25

This paper tackles the bilevel optimization problem where the goal is to minimize a convex smooth objective function over the optimal solutions of another convex smooth constrained problem. The authors introduce a novel method that approximates the solution set of the lower-level problem using a cutting plane approach and applies an accelerated gradient-based update to minimize the upper-level objective. They offer non-asymptotic convergence guarantees for both suboptimality and infeasibility errors, and under the r-th Holderian error bound assumption, they achieve an enhanced convergence bound. Given that all reviewers endorsed acceptance, this paper should be accepted. However, a concern is the difficulty in solving the subproblem (line 6 of Algorithm 1), as noted by Reviewer R6Rr. I suggest the authors provide a detailed discussion on this in the main part of the paper rather than in the appendix.