An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness
This paper introduces a new algorithm and analysis to accelerate bilevel optimization under unbounded smoothness.
摘要
评审与讨论
This work develops the first algorithm that achieves the near-optimal oracle complexity (the number of evaluations to gradient or Hessian/Jacobian vector product) to achieve an -stationary point of a stochastic bilevel optimization problem with unbounded smooth nonconvex upper-level function and strongly-convex lower-level function. The algorithm design is novel which updates the upper-level variable by normalized stochastic gradient descent with recursive momentum and updates the lower-level variable by the stochastic Nesterov accelerated gradient (SNAG) method.
优点
This work studies a specific stochastic bilevel optimization problem with significant applications in sequential data learning. The upper-level function has unbounded smoothness which is more general than the widely used Lipscthiz smoothness assumption. The oracle complexity is improved on this specific bilevel optimization problem compared with existing works. The algorithm design not only combines but also adjusts STORM variance reduction and Nesterov acceleration. The 2 real-world experiments look convincing. The presentation is clean and clear.
缺点
The major weakness is Assumption 4.2 as mentioned in the question (1) below. Some points could be clarified as mentioned in other questions below.
问题
(1) My major concern is that Assumption 4.2 is made on algorithm-generated variables not on problem setting, which cannot be verified. Is it possible to bound in terms of to get rid of Assumption 4.2? If not, at least Assumption 4.2 should also be mentioned in the main result (Theorem 4.1). I would like to raise my rating if all the assumptions become problem-related.
(2) In line 48 "each realization of the stochastic oracle calls is Lipschitz with respect to its argument", do you mean and are Lipschitz continuous functions of for each ? You might express in formula to make it more clear.
(3) You could also cite [1] as a paper on relaxed smoothness:
[1] Convex and Non-convex Optimization Under Generalized Smoothness
(4) In line 127, do you mean "... is a good approximation of if "?
(5) Is there an intuition why line 21 (averaging of y) is needed in Algorithm 2? Is it possible to implement steps of SNAG such that and remove averaging of y for every outer iteration t, and obtain convergence result purely in expectation instead of high probability? This yields the same oracle complexity per round since are of the same order, and may get rid of Assumption 4.2 by bounding in terms of .
(6) You said your upper-level algorithm design is inspired from [17,49]. However, in line 23 (STORM variance reduction) of Algorithm 2, your second and third additive terms use the same samples while [49] uses different samples. The algorithm design differs from [17] more. Such difference still exists even if Algorithm 2 is reduced to single-level. Why do you use such different design?
(7) Also in line 23, does denote the average of the expression right below Eq. (2) over samples ? You could define more clearly.
(8) At the beginning of Section 5 experiments, appears in your AIC maximization. Do you mean ? If not, what do and mean?
局限性
The conclusion section mentions the limitation that the convergence analysis for the Option I of Algorithm 2 relies on the lower-level problem being a one-dimensional quadratic function. Actually I think this is fine since Option II works for more general strongly convex lower-level function. The major limitation from my perspective in mentioned in my above question (1) about the undesired Assumption 4.2. The authors said in the checklist that they do not see any negative societal impacts in their theoretical work, which I agree.
Thank you for taking the time to review our paper. We have addressed each of your concerns below.
A1. Assumption 4.2 used in Section 4.3.1 is indeed problem-related and can be verified when applied to the bilevel optimization setting. Note that the main goal of Section 4.3.1 is to provide a general framework for proving the convergence of SNAG under distributional drift. This framework can be leveraged as a tool to control the lower-level error in bilevel optimization and derive Lemma 4.7. In other words, we can view the lower-level problem of bilevel optimization as a special case within the more general framework provided in Section 4.3.1. We have mentioned the connection between Section 4.3.1 and Section 4.3.2 at lines 232-239 and lines 270-273.
Now we verify that Assumption 4.2 holds in the bilevel optimization setting. We can regard the change of the upper-level variable at each iteration as the distributional drift for the lower-level problem. Thus, using the notation in Section 4.3.1, we can set the objective sequences as , the minimizers at time and as and , the minimizer drift at time as , and the stochastic gradient as . Moreover,
- By Assumption 3.2, is -strongly convex and -smooth in .
- By Assumption 3.3, the noise is norm sub-Gaussian conditioned on with parameter .
- By Lemma B.1 at Appendix B and line 24 of Algorithm 2 (i.e., update for ), is -Lipschitz continuous, and . Then for all , we have . This implies that the minimizer drift is almost surely bounded by . Hence, is also sub-exponential conditioned on with parameter .
Thus the lower-level problem of the bilevel setting satisfies Assumption 4.2 with , , , and . Then we can apply the general lemma (i.e., Lemma 4.3) provided in Section 4.3.1 to establish the lower-level error control as stated in Section 4.3.2 (i.e., Lemma 4.4-4.7). We will make it more clear in the revised version of this paper.
A2. Yes, your understanding is correct. It is formally stated in Assumption 3.4 that and are Lipschitz continuous functions of for each and . We will mention it at line 48 in the revised version.
A3. Thank you for your suggestion. We will mention (Li et al., 2023) in the related work section.
A4. Yes, we will fix it in the revised version.
A5. Thank you for your insightful question. First, the main reason for the averaging step is that we need to ensure the condition for all . We achieve this in Lemma 4.7 by choosing a small to make change slowly. This condition is crucial for controlling the average hypergradient estimation error in Lemma 4.8, which depends on . Thus we need at least , , and to guarantee .
Second, we indeed need high probability analysis for the lower-level variable. When the upper-level problem is unbounded smooth, the hypergradient bias depends on both the lower-level approximation error and the hypergradient itself, , see lines 803-804 of Lemma E.1 for details. These two elements are statistically dependent, necessitating high probability analysis for the lower-level problem. Moreover, Appendix D.3 shows that when , we need steps to guarantee with high probability (lines 749 and 755). Please refer to the proof of Lemma 4.6 (lines 748-757) in Appendix D.3 for details.
Third, as mentioned in A1, we can show that our bilevel problem setting satisfies Assumption 4.2. This allows us to apply Lemma 4.3 in Section 4.3.1 to control the lower-level approximation error with high probability.
A6. Please note that in Corollary 3.2 and 3.4 of (Liu et al., 2023), they set in Algorithm 1. This means their second and third additive terms use the same samples. Our algorithm differs from that in (Cutkosky and Orabona, 2019) mainly in the update for , where we use normalized SGD and they use SGD (both with recursive momentum). This difference is due to our assumption that the upper-level problem is relaxed smooth, necessitating the use of normalization to reduce the effect of noise. In contrast, Cutkosky and Orabona (2019) assumes the single-level objective is -smooth, thus SGD (with recursive momentum) is sufficient for their algorithm design.
A7. Thank you for catching this issue. In the revised version, we will use to denote the average of the stochastic hypergradient estimators more clearly.
A8. Thank you for noticing this, and we apologize for the typo. It should be instead of . We will fix this typo in the revised version.
Reviewer 2WtS agrees with the authors and will keep rating 5.
Dear Reviewer 2WtS,
We are glad that our responses addressed your major concerns. Thank you for reviewing our paper. You mentioned earlier that "I would like to raise my rating if all the assumptions become problem-related." We have now verified that Assumption 4.2 holds in the bilevel optimization setting (under Assumptions 3.1-3.4) and it is indeed problem-related. Could you please consider raising your score?
Please let us know if you have any further questions or concerns.
Best regards,
Authors
Dear Authors,
Reviewer 2WtS just raised rating from 5 to 6, and soundness from 2 to 3. Best regards,
Reviewer 2WtS
Dear Reviewer 2WtS,
Thank you for raising the rating.
Best regards,
Authors
The authors consider bilevel optimization problems under the unbounded smoothness assumption of the outer function. They propose AccBO, an AID-based method that uses Stochastic Nesterov Accelerated Gradient to approximate the lower-level solution and Neumann approximations to estimate the inverse Hessian-Vector product involved in the hypergradient expression. Then, a normalized STORM-like update is performed for the outer variable. They show a complexity of their algorithm. Finally, numerical experiments are provided on the deep AUC maximization and data hypercleaning problems.
优点
- S1: The paper is well-written
- S2: The algorithms proposed by the authors achieve SOTA complexity for stochastic bilevel optimization under unbounded smoothness assumption.
缺点
Major
-
W1: Some errors in the code may invalidate the experimental results. See the comment C2 in the Code section for details.
-
W2: AccBO comes with many hyperparameters, which can be a concern for practical applications.
-
W3: l.219: "is nearly optimal in terms of the dependency of ". The authors should be careful when talking about optimality of an algorithm. In particular, the algorithm AccBO does not belong to the algorithm class considered in [1] which does not take into account the biasedness of the hypergradient estimate due to the approximation of the inverse Hessian-vector product and the solution of the inner optimization problem. To my knowledge, the only works considering lower bounds for bilevel problems are [2, 5].
-
W4: The works [2, 3, 4] also use Nesterov acceleration for the inner optimization problem, but in the deterministic setting. They should be mentioned in the paper.
Code
Even though it is a theoretical paper, I have several comments regarding the experiments code:
-
C1: In the checklist, the authors indicate "The code and data are attached as a supplement with instructions for reproducibility." However, the code documentation is insufficient to run the experiments. The data are not present in the folders and there is no instruction indicating where to download them. As a consequence, when I launch the commmand
python main.py --methods accboas indicated in theREADME.md, I get an error. -
C2: It seems that the implementation of the different methods do not correspond to the one described in their respective original paper, which might invalidate the empirical results:
- For
TTSA,SUSTAINandAccBO, the estimation of the inverse Hessian-vector product is computed by the formula
while in the original paper (and even the present paper for AccBO), the authors use the formula
with .
-
For
SABA, the SAGA-like variance reduction is absent in the code while being the core of the algorithm. Moreover,SABAuses constant step sizes. -
For
VRBO, the oracles in the casestep % self.args.update_interval == 0should be computed with a larger batch-size (or even with full batches) than in the inner for-loop.
- For
-
C3: The code of
VRBOfor the datacleaning task is missing. The method to compute the hypergradient inma_soba.pyis missing as well for the datacleaning task (but not AUC experiment, thus I assume it is almost the same).
Minor
- l.41: "transforms" -> "transformers"
- In
README.md(both):python main.py --medthods [algorithm]->python main.py --methods [algorithm]
I am inclined to raise my score if my concerns, particularly regarding the experiments, are addressed.
问题
N/A
局限性
N/A
Thank you for taking the time to review our paper. We have addressed each of your concerns below.
W1. Code issue.
A1. Please see A5 to A7 for details.
W2. AccBO comes with many hyperparameters.
A2. Please note that existing literature on bilevel optimization with variance reduction (Khanduri et al., 2021; Yang et al., 2021; Guo et al. 2021) all involve many hyperparameters. Compared to previous works, AccBO has more hyperparameters since our bilevel problem setting is more challenging: our upper-level problem is unbounded smooth, while others consider -smooth upper-level functions. Hence their algorithm designs cannot be applied to our setting.
To achieve acceleration in our problem setting, we update the lower-level variable using the SNAG method with averaging, resulting in additional hyperparameters and (line 8 and 21 of Algorithm 2), as well as and (only for Option II). In practice, and need tuning, while and are set to default values as in (Hao et al., 2024).
W3. Optimality and lower bound issue.
A3. Thank you for your insightful question. First, please note that existing literature on bilevel optimization with variance reduction (Khanduri et al., 2021; Yang et al., 2021) state that their complexity results are near-optimal in terms of the dependency on ; see Section 1.1 of (Yang et al., 2021) and Section 1 of (Khanduri et al., 2021) for details.
Regarding existing works considering lower bounds for bilevel problems, please note that the lower bounds in (Ji & Liang, 2023) apply only to deterministic and convex/strongly-convex upper-level problems, and the lower bounds in (Dagréou et al., 2023) apply only to nonconvex smooth finite-sum objectives in the stochastic setting. In contrast, we study the general expectation form in the stochastic setting, where the upper-level problem is nonconvex and unbounded smooth. Therefore, their lower bounds cannot be applied to our problem setting.
Now we will establish a lower bound via a simple reduction to single-level stochastic nonconvex smooth optimization, assuming the stochastic gradient oracle has mean-squared smoothness.
It is shown in (Arjevani et al., 2023) that with the mean-squared smoothness assumption, complexity is necessary for finding an -stationary point when using stochastic first-order methods for single-level stochastic nonconvex optimization problems. Note that our problem class is more expressive than the function class considered in (Arjevani et al., 2023), making our problem harder. This is because standard smoothness is a special case of relaxed smoothness and single-level optimization is a special case of bilevel optimization.
For example, if we consider an easy case where the upper-level function does not depend on the lower-level problem (e.g., such that is independent of ) and is mean-squared smooth, then the lower bound in (Arjevani et al., 2023) can be applied in our setting. Therefore the complexity achieved in this paper is already optimal up to logarithmic factors in terms of the dependency on .
W4. The works [2, 3, 4] should be mentioned in the paper.
A4. Thank you for your suggestion. We will mention [2, 3, 4] in the related work section. However, could you please clarify what [2, 3, 4] refer to? It seems that the references you mentioned are missing in your review.
C1. Data and instructions are missing.
A5. Thank you for catching this issue. We forgot to include the instructions on how to download the data. We have fixed it (including the data and instructions) in the revised version of the code. Please reach out to AC for the code link.
C2. Implementation issues.
A6. Please refer to the global rebuttal PDF file for new experimental results and hyperparameter settings.
- Neumann series. In practice, we have fixed the implementation of Neumann series using random and it does not affect the performance too much. In theory, the expectations of and (defined as the two formulas you wrote) are the same. This means that both and can be applied to our algorithm and analysis, and using either one does not affect the convergence rate of AccBO. We prefer over in practice due to less randomness in , leading to slightly better and more stable performance.
- SABA. We have fixed the implementation of SABA and now use a constant step size for SABA. See results in the rebuttal PDF file.
- VRBO. We choose a larger batch size for the outer loop (i.e., at the checkpoint) than for the inner loop. In experiments, we tune the batch size for the outer loop and set it to be twice that of the inner loop. We rerun the two experiments, see results in the rebuttal PDF file.
C3. The VRBO code and the hypergradient computation method in ma_soba.py are missing for the datacleaning task.
A7. The file ma_soba.py used in the data-cleaning task is similar to the one used in the AUC task. We have added the implementation of VRBO and the hypergradient computation method in ma_soba.py for the data hypercleaning task. Please reach out to AC for our anonymized code link.
Minor Issues
We will fix the typos in the revised version of the paper and code.
Dear Reviewer o6Pr,
Thank you for reviewing our paper. We have carefully addressed your concerns regarding the optimality of our algorithm, the implementation of the inverse Hessian-vector product and SABA, as well as the learning rate and batch size choices for the baselines SABA and VRBO.
We have updated our code and rerun the experiments. The new results, along with the hyperparameter settings, are included in the global rebuttal PDF file. Please reach out to the AC for the anonymized code link if needed, as we were notified that "If you were asked by the reviewers to provide code, please send an anonymized link to the AC in a separate comment."
Please let us know if our responses address your concerns accurately. We appreciate your time and efforts and are open to discussing any further questions you may have.
Best regards,
Authors
I thank the authors for fixing their code and modified my score accordingly.
Dear Reviewer o6Pr,
We are glad that our responses addressed your concerns. Thank you for reviewing our paper.
Best,
Authors
This paper investigates a class of stochastic bilevel optimization problems where the upper-level function is nonconvex with potentially unbounded smoothness, and the lower-level problem is strongly convex. To improve the convergence rate, the authors propose a new Accelerated Bilevel Optimization algorithm named AccBO. This algorithm updates the upper-level variable using normalized stochastic gradient descent and the lower-level variable using the stochastic Nesterov accelerated gradient descent algorithm with averaging. The proof shows that AccBO has an oracle complexity of , relying on a novel lemma describing the dynamics of the stochastic Nesterov accelerated gradient descent algorithm under distribution drift. Experimental results demonstrate that AccBO significantly outperforms baseline algorithms in various tasks, achieving the predicted theoretical acceleration.
优点
1.The paper introduces AccBO, a novel algorithm that leverages normalized recursive momentum for the upper-level variable and Nesterov momentum for the lower-level variable. This dual approach to acceleration in bilevel optimization under stochastic settings is a significant advancement.
2.AccBO achieves an oracle complexity of for finding an -stationary point. This is a substantial improvement over the previously best-known complexity of for similar problems.The paper provides new proof techniques, particularly in analyzing the dynamics of stochastic Nesterov accelerated gradient descent under distribution drift.
3.The effectiveness of AccBO is empirically verified across various tasks, including deep AUC maximization and data hypercleaning. The results show that AccBO achieves the predicted theoretical acceleration and significantly outperforms baseline algorithms.
4.The paper extends the bilevel optimization framework to handle upper-level functions with unbounded smoothness, which is a realistic scenario in many neural network applications. This makes the proposed algorithm applicable to a broader range of practical problems.
缺点
The convergence analysis for Option I of the algorithm is restricted to the lower-level problem being a one-dimensional quadratic function. This limitation reduces the generality and applicability of the theoretical results for more complex lower-level problems.
问题
In reference [33], Assumption 2 states , while in this paper, Assumption 3.2 states . Can this assumption be relaxed to a less stringent form?
局限性
The convergence analysis for Option I of the AccBO algorithm assumes that the lower-level problem is a one-dimensional quadratic function. This limits the algorithm's applicability to scenarios where the lower-level function deviates significantly from this simple form, especially in higher dimensions or more complex optimization landscapes.
Thank you for taking the time to review our paper. We have addressed each of your concerns below.
Q1. The convergence analysis for Option I of the algorithm is restricted to the lower-level problem being a one-dimensional quadratic function. This limitation reduces the generality and applicability of the theoretical results for more complex lower-level problems.
A1. Thank you for your insightful question. Please note that Option II of Algorithm 2 works for the general case of high-dimensional functions and also enjoys an accelerated convergence rate of . Therefore, it can indeed handle complex lower-level problems in high dimensions and is more general than Option I.
One can regard Option I as a way to leverage the nice structure of the lower-level problem, making it easier to implement in practice. When the lower-level function is one-dimensional and quadratic (e.g., deep AUC maximization), we can invoke Option I, making AccBO a single-loop algorithm, which is easier to implement. In the general case, when the lower-level objective is a strongly convex function in high dimensions (e.g., data hypercleaning), we can invoke Option II, resulting in AccBO becoming a double-loop algorithm.
Q2. In (Hao et al., 2024), Assumption 2 states , while in this paper, Assumption 3.2 states . Can this assumption be relaxed to a less stringent form?
A2. In this paper, we use the Neumann series approach to handle the Hessian inverse and hypergradient approximation. Therefore, we need to upper bound the bias term when proving the convergence of AccBO (see lines 803-804 of Lemma E.1 in Appendix E for details). The assumption is particularly crucial for deriving Lemma B.5 in Appendix F.1 under our current analysis, which is necessary for proving Lemma E.1.
In contrast, (Hao et al., 2024) uses the linear system approach to handle the Hessian inverse and hypergradient approximation, so bias terms like do not appear in their analysis. As a result, they only need a slightly relaxed assumption, , to show the smoothness properties of the bilevel optimization problem.
In fact, many existing works in the bilevel literature that use the Neumann series approach to handle the Hessian inverse require this assumption (i.e., ). For example, see Assumption 1 of (Ghadimi and Wang, 2018), Assumption 2 of (Ji et al., 2021), Assumption 1 of (Hong et al., 2023), Assumption 1 of (Khanduri et al., 2021), Assumption 1 of (Yang et al., 2021), and Assumption 1 of (Chen et al., 2021).
Q3. The convergence analysis for Option I of the AccBO algorithm assumes that the lower-level problem is a one-dimensional quadratic function. This limits the algorithm's applicability to scenarios where the lower-level function deviates significantly from this simple form, especially in higher dimensions or more complex optimization landscapes.
A3. We have addressed your concern about this limitation in A1. Please see A1 for details.
This paper considers a class of stochastic bilevel optimization problems where the upper-level function is nonconvex with potentially unbounded smoothness, and the lower-level problem is strongly convex. Their novel algorithms achieve an oracle complexity of to find an -stationary point.
优点
- The setting is novel: they consider problems with potentially unbounded smoothness.
- Their complexity strictly improves the state-of-the-art oracle complexity for unbounded smooth nonconvex upper-level problems and strongly convex lower-level problems.
缺点
- In the algorithm, for the update of the normalized stochastic gradient, it seems that it can be reformulated to tune the learning rate by dividing by the norm of the stochastic gradient. Therefore, the novelty and reasonableness of this step may be unclear.
- In the experiments, it is common to use the same learning rate for baselines and the proposed algorithms for a fair comparison. However, in the experimental details, the authors provide the best learning rate pairs for different baselines, which seems unusual.
问题
- Regarding the main challenge, the authors state why previous algorithms and analyses do not work. However, this explanation is not clear to me. Can you elaborate on how your work addresses the problem of hypergradient bias? Also, how does your potential function argument differ from previous work?
局限性
N/A
Thank you for taking the time to review our paper. We have addressed each of your concerns below.
Q1. In the algorithm, for the update of the normalized stochastic gradient, it seems that it can be reformulated to tune the learning rate by dividing by the norm of the stochastic gradient. Therefore, the novelty and reasonableness of this step may be unclear.
A1. Thank you for your insightful question. We would like to emphasize that we use normalized stochastic gradient descent with recursive momentum for updating the upper-level variable to reduce the effects of stochastic gradient noise, as well as the effects of unbounded smoothness and gradient norm. A similar approach of using normalized stochastic gradient descent with standard momentum is outlined in the Section 3.1 of (Hao et al., 2024).
The novelty and reasonableness of our upper-level update lie in the following two aspects: first, we use recursive momentum update for (line 23 of Algorithm 2) to achieve variance reduction and acceleration, while Hao et al., (2024) uses moving average estimation for ; second, we choose a larger learning rate to achieve acceleration. Specifically, for any given small (i.e., the target gradient norm ), our algorithm's learning rate is , while that of (Hao et al., 2024) is . The large learning rate is crucial for proving acceleration.
Q2. In the experiments, it is common to use the same learning rate for baselines and the proposed algorithms for a fair comparison. However, in the experimental details, the authors provide the best learning rate pairs for different baselines, which seems unusual.
A2. We would like to clarify that it is indeed a common practice in the bilevel optimization literature to compare different baselines using the best-tuned learning rate pairs. For example, see Appendix A and B of (Ji et al., 2021), Appendix B of (Yang et al., 2021), Section 5 of (Hong et al., 2023), Section 4 of (Khanduri et al., 2021), and Section 4 of (Chen et al., 2022).
Q3. Regarding the main challenge, the authors state why previous algorithms and analyses do not work. However, this explanation is not clear to me. Can you elaborate on how your work addresses the problem of hypergradient bias? Also, how does your potential function argument differ from previous work?
A3. Thank you for your insightful question, and we apologize for the ambiguity. Most previous works, such as (Khanduri et al., 2021; Yang et al., 2021), assume that the upper-level objective function is -smooth. Their algorithms and analyses rely on the -smoothness property to leverage some nice forms of hypergradient bias, which typically depend on the approximation error of the lower-level variable, and either the Neumann series approximation error or the linear system solution approximation error when handling the Hessian inverse. Most importantly, their hypergradient bias does not depend on the (ground-truth) hypergradient itself.
In contrast, when the upper-level problem is unbounded smooth (i.e., -smooth as illustrated in Assumption 3.1), the hypergradient bias depends on both the approximation error of the lower-level variable and the hypergradient itself, for example, . Please see lines 803-804 of Lemma E.1 in Appendix E for details. These two elements are statistically dependent, necessitating high probability analysis for the lower-level problem, which means the standard expectation analysis used in previous works cannot be applied to our setting. Another recent work, (Hao et al., 2024), studies the same setting as ours, but their algorithm does not achieve acceleration.
Let us provide more details to explain how we handle the hypergradient bias (lines 803-804 of Lemma E.1). Specifically:
- In Lemma D.6, we control the first term by Jensen's inequality and induction.
- In Lemma B.4, we control the second term by leveraging the Neumann series approximation result from (Khanduri et al., 2021).
- In Lemma B.5, we control the third term using the triangle inequality and the relaxed smoothness property of the upper-level function.
With a suitable choice of parameters, the hypergradient bias can be small on average.
Regarding the potential function argument, we directly use the objective itself as the potential function. This approach differs from previous works that incorporate both the function value and the approximation error from the lower-level problem. Our problem setting and analysis do not align with those in previous works, necessitating this different approach.
Thank you very much for your reply, below we provide further explanations to your additional concerns.
-
First, we would like to clarify that the learning rate should be instead of when considering normalized stochastic gradient descent. For example, see Algorithm 1 of (Cutkosky and Mehta, 2020) (Input: , learning rate , ), Algorithm 1 of (Jin et al., 2021) (Input: , learning rate , ), and Algorithms 1 and 2 of (Liu et al., 2023b) (middle of page 6, "the step size is chosen by balancing every other term to get the right convergence rate") for details. We follow the same practice and refer the learning rate of the update (i.e., line 24 of Algorithm 2) as , rather than . Additionally, our choice of learning rate is independent of the magnitude of , please see Theorem 4.1 and Eq. (55) in Appendix E on page 31 for more details.
-
Second, please note that we study the same problem setting as (Hao et al., 2024), where the upper-level problem is unbounded smooth (i.e., -smooth as illustrated in Assumption 3.1). Both our proposed Algorithm 2 (i.e., AccBO) and Algorithm 1 in (Hao et al., 2024) use normalization and choose a fixed learning rate . We mention allowing a larger learning rate because, for any given small (i.e., the target gradient norm ), AccBO's learning rate is , while that of (Hao et al., 2024) is . In other words, we say that we choose a larger learning rate specifically for comparison with (Hao et al., 2024). The large learning rate is crucial for achieving acceleration: it improves the complexity from as in (Hao et al., 2024) to as in this paper.
Please let us know if you have further questions. Thanks!
Best,
Authors
Thank you for your answer. I will raise my score accordingly.
Best.
Dear Reviewer wLtJ,
We are glad that our answer addressed your concerns. Thank you for reviewing our paper.
Best,
Authors
Dear Reviewer wLtJ,
Thank you for reviewing our paper. We have carefully addressed your concerns regarding the novelty and reasonableness of the learning rate choice, the fairness of comparisons between our proposed algorithm and other baselines, and how our work tackles the challenge of hypergradient bias.
Please let us know if our responses address your concerns accurately. If so, we kindly ask you to consider raising the rating of our work. We appreciate your time and efforts and are open to discussing any further questions you may have.
Best,
Authors
I appreciate the authors' reply. I understand the rationale for a larger learning rate from the convergence rate perspective. However, I still find the gradient normalization on line 24 in Algorithm 2 somewhat problematic in claiming this larger learning rate. From my perspective, when the norm of is larger, it would allow for a higher learning rate. Please correct me if I'm mistaken.
General Response to All Reviewers
Thank you to all the reviewers for taking the time to review our paper and provide valuable feedback. We have addressed each of your concerns individually, and below is a summary of the key changes made during the rebuttal phase.
-
We would like to clarify that Option I of AccBO is not a drawback of our approach, it is actually a benefit. Option I leverages the simpler structure of the lower-level problem, making it easier to implement. For one-dimensional quadratic functions (e.g., deep AUC maximization), Option I turns AccBO into a single-loop algorithm, simplifying implementation. Option II makes AccBO a double-loop algorithm, which works for the general case of high-dimensional strongly convex functions (e.g., data hypercleaning) and also enjoys an accelerated convergence rate of , the same as Option I. This makes it capable of handling complex high-dimensional lower-level problems and more general than Option I.
-
We have updated the code and experiments, and our original implementation on Neumann series is not wrong. We have added new experiments based on the reviewer's request (e.g., different learning rate schemes and batch sizes), and the results are reported in the global rebuttal PDF file. It shows that our algorithm AccBO still significantly outperforms other methods. If you want to check the updated code, please reach out to AC (we have sent the anonymized code link to the AC).
-
We have fixed the typos and minor issues, and we will update them in the revised version of the paper.
There was some disagreement among the reviewers about this submission. Based on re-reading the whole discussion (taking into account reviewer's levels of expertise) and my own evaluation of the paper, I recommend to accept.