PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差0.7
5
6
7
6
3.5
置信度
正确性2.8
贡献度2.8
表达3.0
NeurIPS 2024

Adaptive Proximal Gradient Method for Convex Optimization

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

Adaptive versions of GD and ProxGD with large steps.

摘要

关键词
adaptive methodsgradient descentproximal gradient method

评审与讨论

审稿意见
5

The paper refines the analysis of AdGD to accommodate larger stepsizes, an approach that exploits local curvature instead of global smoothness. The technique is later extended to ProxAdGD. Experiments demonstrate the superiority of ProxAdGD over Armijo’s linesearh.

优点

  1. The paper unfolds by providing intuition and examples, helping the readers to deepen their understanding beyond the technical results.
  2. Better guarantees that exploits local properties of the objective are desirable, both in deterministic and stochastic settings.

缺点

  1. A more detailed comparison to the results of [MM20] is required. Does the improvement merely tighten the constants or enable a drastically different behaviour than [MM20]?
  2. Experiments comparing the stepsize choices of the paper and [MM20] are sorely missing. Both the difference between the stepsizes along the optimization and the optimization error of each method.

问题

  1. See weaknesses.
  2. Did the authors consider the stochastic case? Either by a theoretical analysis or by experiments. [MM20] provided both a theoretical guarantee (without adaptivity to local curvature) and experiments with neural networks. (The reviewer does not ask the authors to conduct new experiments with neural networks but merely asks if such were previously performed.)

Overall, the paper presents a clear picture of AdGD which helps deepen our understanding, includes an improved stepsize selection and the new AdProxGD method. That being said, the quantification of the improvement is unclear due to limited comparison with [MM20]. Such a comparison will move along way toward establishing the value of this work.

Typos: line 54 - some some, line 272 - experimentswe.

局限性

Discussed in weaknesses.

作者回复
  1. The improvements over [MM20] are multiple:

    (i) We allow for larger steps, leading to improved final complexity. While a direct comparison is missing due to the less explicit final bound in [MM20], we have strived to make our bounds as explicit as possible in this work.

    (ii) We have gained a better understanding of the proposed algorithm and the reasons behind the complex structure of the stepsize (Theorem 1).

    (iii) We have extended this improved analysis to the proximal case, which is a non-trivial task compared to the setting with a fixed stepsize.

  2. In our work, we haven't conducted experiments in the stochastic case. In [MM20], there wasn't a strong theory for the stochastic case, so a variation of the algorithm proposed in [MM20] was tested with additional hyperparameters. These hyperparameters cover a family of algorithms, including the newly proposed one (though [MM20]'s theory didn't support it). So the method we proposed here was already tested in unconstrained stochastic case by [MM20], and by repeating these experiments we wouldn't gain new insights. Additionally, the constrained case for neural networks is very uncommon and not particularly interesting.

    We hope that our response addresses your concern and we would greatly appreciate it if you revisited the score given to our work. As it currently stands, it is unclear to us why you voted for the rejection of our work.

评论

I thank the authors for their response.

From what I gather, the step-size update of Algorithm 1 is larger than that of [MM20] only in cases where 1+θk1αk1>12Lk\sqrt{1+\theta_{k-1}} \alpha_{k-1} > \frac{1}{\sqrt{2} L_k} (due to the recursive step-size update the actual comparison is more nuanced), so the step-sizes of [MM20] are not strictly smaller but may be equal. The case of Algorithm 2 is even more complicated as 1+θk1αk1\sqrt{1+\theta_{k-1}} \alpha_{k-1} is replaced with 23+θk1αk1\sqrt{\frac{2}{3}+\theta_{k-1}} \alpha_{k-1}.

Is it possible to show that either the new step-sizes are strictly greater (>> and not \geq) than that of [MM20]? If not, it is possible to try and demonstrate it by experiments (even plotting the step-sizes on the same figure).

评论

While a direct comparison of the stepsizes is impossible due to different trajectories, our theory does give an improvement for the sum of stepsizes. The latter determines the convergence rate as it appears in the denominator of the upper bound. In this respect, we can rigorously prove that both of our proposed algorithms (Algorithms 1 and 2) have a better upper bound on the functional gap than [MM20].

As for the experiments, we didn't include a direct comparison because our experiments focused on constrained (or composite) problems, whereas [MM20] only dealt with unconstrained problems. The methods perform similar in practice, which is in line with theory that only guarantees a constant factor improvement, with the new method converging slightly faster due to the larger sum of stepsizes.

评论

... our theory does give an improvement for the sum of stepsizes... In this respect, we can rigorously prove that both of our proposed algorithms (Algorithms 1 and 2) have a better upper bound on the functional gap than [MM20].

Can the authors provide the proof? Is the sum of stepsizes strictly larger than that of [MM20]? By what factor?

I believe that such a quantification of the improvement will go along way to strengthen the claim that the stepsizes are larger.

评论

For Algorithm 1 in this paper, we didn't provide a detailed proof since it's fairly straightforward and follows the same logic as in [MM20]: the first bound is always increasing, and the second is always greater than 12L\frac{1}{\sqrt{2} L}, so the result follows by induction. For Algorithm 2, this result is less obvious and is proven in Theorem 2 (see Appendix B3, page 15, for the exact proof).

For both algorithms in this paper, the constant is improved to 12\frac{1}{\sqrt{2}} instead of 12\frac{1}{2}. While this might seem like a minor improvement, it's important to realize that this result is for pure gradient descent, without any additional assumptions. Given the limited tools available, it's quite surprising that such an improvement over [MM20] was possible at all. We believe that how this was proved is more significant than the magnitude of the improvement.

The proof for Algorithm 2 was already quite complex. However, with a more detailed case-by-case analysis, this constant could potentially be improved further (see Remark 5, page 16).

To demonstrate that the lower bound 12L\frac{1}{2L} for [MM20] is indeed achievable, one can consider a function with (almost) constant curvature.

评论

I thank the authors for their response.

  1. Thanks for the clarification. While not complicated given the main idea (and partly existing in the manuscript), I believe it would be beneficial to state that the lower bound of 12L\frac{1}{2L} is tight in the worst case for previous work (preferably with a proof sketch or referencing to an appendix proof), while the current analysis is improved.

Further suggestions:

  1. For a future revision or preprint, please consider to provide a comparison of the exact stepsizes compared to [MM20] in an experiment with real data. The paper would improve with such an experiment.
  2. Perhaps more emphasis should be put in the paper to the extension of the technique to ProxGD rather than the improved analysis with respect to [MM20].

I thank again for the elaboration and update my score accordingly.

审稿意见
6

The paper introduces new algorithms for solving convex optimization problems, where a step size parameter adapts to the underlying objective function. Adaptation is achieved by making good use of the already available gradient information, and does not have a further computational cost. The authors propose multiple variants. One of those rely on a generalized derivation of an algorithm from a recent paper, to achieve an improved step sizes. Another provides yet larger stepsizes, and a third one allows non-differentiable (but still convex) objective functions, provided that can be split into a smooth and a non-smooth function where the proximity operator of the non-smooth function is feasible to realize. They compare their algorithm against a simple alternative on a specific problem to demonstrate how it can reduce computational cost.

优点

The first algorithm alone, which relies on a generalized derivation of a past algorithm, is of interest. The two additional algorithms also allow to further expand the usefulness of the paper. In particular, the third algorithm, which allows handling of non-smooth functions, would be useful in practice. In particular, it extends the applicability of the algorithm to constrained problems (provided projections to the constraint set is feasible in practice), or potentially, problems with sparsity constraints, which can be enforced by convex functions like some variant of an 1\ell_1 norm. I think such problems are pretty relevant for ML, and I'd expect the paper to be of interest to a wide community.

缺点

The attention is restricted to convex problems. To be fair, this is perhaps welcome, as it allows the authors to provide a clean analysis of their algorithms. In practice, many convex algorithms found their way to non-convex problems with success, sometimes without justification.

At times, I wished the authors would complete their arguments fully. I have a few comment below.

The numerical problem didn't feel very fair for the competitor algorithm (I probably would not have used such an algorithm due to how expensive its steps are -- see below).

问题

  • eqn 13 : I don't see how a simple substitution of the previous inequality to (10) gets you this. If you're using additional steps, please either not, or include details -- it's perfectly understandable to push some content to the appendix, but the manuscript is perfectly brings the development to this stage to suddenly switch gears. Since this subsection is meant to be simple, I'd suggest including details.
  • line 191, "note that the second bound..." : please note this is the "second bound in step 5 of Alg. 2".
  • eqn 15 : In this inequality, should αk\alpha_k and αk1\alpha_{k-1} be interchanged? In that case, the following interpretation is also not correct.
  • eqn 16 : For a wildly changing function L1L_1 won't capture the behavior globally anyway. Is this meant for a relatively "well-behaved" function?
  • line 221 : I'd suggest writing out the definition of the proximity operator argmint0.5xt22+g(t)arg min_t 0.5 \|x - t\|_2^2 + g(t) instead of relying on short hand. The text is perfectly readable without knowing what a proximity operator is. In fact, you could also include αk\alpha_k in there, since that's used in the algorithm description.
  • line 222, "Algorithm 3, presented in Appendix C" : Please include the algorithm as part of the main text.
  • Numerical experiment on the constrained problem : I understand the motivation behind comparing the objective wrt the projections. However, this is particularly a problem where a line search would be feasible due to the high computational cost. It would be interesting to compare against a basic forward backward splitting method with a fairly reasonable estimate of the local LL. If that's not possible, and the baseline algorithm is the most suitable, it'd be good to include an argument.

局限性

N/A

作者回复

Thanks for the positive evaluation of our work!

  1. Indeed, we tried to explain our derivations as thoroughly as possible. As we mentioned, equation (13) is just a substitution of the previous equation into (10). We may have forgotten to clarify that αk2f(xk)2=xk+1xk2\alpha^{2}_{k} \|\nabla f(x^{k})\|^{2} = \|x^{k+1} - x^{k}\|^{2}. We will make this clearer in the revision.

  2. True, we will add this in the revision.

  3. We believe eq. (15) is correct as it stands. It is derived directly by squaring the previous inequality and dividing both sides by αk12\alpha^{2}_{k-1}.

  4. We are not sure we fully understood the question. No single number can capture the global behavior of a function. In our case, L1L_{1} is simply the first approximation of a local Lipschitz constant around x1x^{1}. Note that it serves as a lower bound for the global Lipschitz constant LL, and this may be the only aspect it captures.

  5. We beg to disagree here about the use of the short notation. We believe prox\mathrm{prox} enhances readability in the same way that the metric projection PCP_{C} is much more convenient than writing argminxCxa2\arg\min_{x \in C} \| x-a \|^{2} each time. However, we agree that it's better to include the definition you mentioned when we define the proximal operator in line 221.

  6. Thanks for this comment! You cannot imagine how much time we spent trying to fit it into the main text, but it always pushed out something more important. If the paper is accepted, we will have an additional page, so it won't be a problem anymore.

  7. We think it is actually important to compare the algorithm with the most widely used and robust approach — linesearch. Choosing a reasonably accurate estimate of the local Lipschitz constant is easier said than done. For instance, it is far from trivial to determine what to choose in problem (24).

    One of the main goals of this paper was to study composite optimization, where prox or projections matter. However, we conducted an extensive study for the unconstrained case already in [MM20]. Also, note that problems (50) and (52) in the Appendix do not consider projections and only compare gradients.

评论

Thank you for the responses. My comments were mainly suggestions -- I appreciate the additional clarification that you'll be adding. For the suggestions you choose not to follow, I won't press further.

I was going to insist on (15), but I realized both versions are correct. Starting from (assuming the term inside \sqrt{\cdot} is positive)

αkαk12αk12Lk21\alpha_k \leq \dfrac{\alpha_{k-1}}{\sqrt{2 \alpha_{k-1}^2 L_k^2 - 1}}

we have

2αk12Lk21αk12αk2.2 \alpha_{k-1}^2 L_k^2 - 1 \leq \dfrac{\alpha^2_{k-1}}{\alpha^2_k}.

Rearranging and dividing by 2, I get

αk12Lk2αk122αk212. \alpha_{k-1}^2 L_k^2 - \dfrac{\alpha^2_{k-1}}{2\alpha^2_k} \leq \frac{1}{2}.

I thought this contradicts (15) but multiplying both sides by αk2αk12\dfrac{\alpha^2_{k}}{\alpha^2_{k-1}} and rearranging, we get (15).

评论

We thank you for giving the suggestions, they helped us identify places that required clarification. Please let us know if you have unresolved concerns. If however our feedback addressed your concerns, we'd also appreciate it if you updated the paper score.

评论

Thank you, I don't have unresolved concerns. I increased my score to a 6.

审稿意见
7

This paper considers optimization problems where the function is convex and possibly composite. Gradient Descent and the Proximal Gradient method are studied, and one of the main motivations of this work is the paper [MM20], which developed a locally adaptive step size and an associated algorithm 'adaptive gradient descent without descent AdGD'. In the current work, the authors consider whether the bounds on the step for AdGD are essential, and they find that they are, but can be improved upon. They present Algorithm 2, which employs their refined step size, and Theorem 2 details the convergence properties. As well as improving the step size criteria, this work also extends that of [MM20] by presenting an adaptive proximal variant. The algorithm is given in Algorithm 3, and convergence properties are presented in Theorem 3. The authors compare their algorithm with others in the literature in Section 5, including detailing possible extensions, and why they might prove tricky. Finally, numerical experiments are presented in Section 6 (with proofs and further experiments in the appendix).

优点

Writing. The paper was well written and I enjoyed reading it. I liked the writing style because the authors made efforts to explain why they were doing what they were (not just what they were doing). I liked that the authors opted to keep the exposition simple and easy to follow (e.g. see Remark 1 regarding the potential use of cocoercivity). Contribution. Gradient descent is one of the most important algorithms in this field, and the step size has a major impact on the practical performance of the algorithm. This work presents an step size selection condition that is conceptually simple, and is cheap to evaluate, which is a nice contribution. The extension to the proximal setting is also welcome.

缺点

It would be good if the authors tightened up the wording of, for example, their theorem statements. For example, for Theorem 2, I think it would be better if they mentioned "given an initial point x0x_0, and an initial step size α0>0\alpha_0>0. Also, the phrase "converges to a solution" is used, but it would be better if they also referred to the problem they were trying to solve and said what they mean by a solution (and they should also specify that F=fF=f, define ff_*, etc). They should then check the other theorem statements and update them appropriately. I feel it is really important for them to be precise.

In Lemma 1 I think it should be mentioned that αk>0\alpha_k>0 (i.e., the step size can be 'arbitrary but positive').

Numerical experiments. The authors presented several numerical experiments, which is good. However, it would have been good to have had more discussion explaining to the reader what the plots were showing, i.e., more emphasis describing the observed practical behaviour of the algorithms, and specifically commenting on the performance for the parameter choices.

问题

N/A

局限性

N/A

作者回复

Thank you for the positive evaluation of our work and the high praise! We are especially pleased that you liked the way the paper is written; we put a lot of effort into making it readable and enjoyable.

We agree with tightening the statements and will make the necessary changes in the revision.

The same goes for Lemma 1.

Regarding the descriptions of the experiments, we were limited by space. If the paper is accepted, we will make sure to add more details.

审稿意见
6

In the paper, the authors explore two adaptive first-order algorithms in convex optimization, namely gradient descent and proximal gradient methods, which are based on observed gradient differences. With a novel Lyapunov energy, the authors prove its convergences assuming only local Lipschitz condition of the gradient and extend it to the proximal case. In addition, the methods allow larger initial stepsizes than those in previous work.

优点

For many other methods, such as Gradient Descent, we should assume the problems are globally L-smooth and calculate the value L. However, we just ensure the object functions are locally L-smooth if using the methods in this paper.

The same as Barzilai-Borwein stepsize, the methods do not include linesearch procedure. It means they are just based on observed gradient differences and do not need high computational cost in each iteration. However, in the contrast to the Barzilai-Borwein stepsize, the paper could guarantee its convergences theoretically in convex and proximal case.

缺点

1.We may not guarantee its convergences in the nonconvex case. 2. In the second experiment, the first term in problem may be (1+x_1^2 )^(1/2), rather than (1+x_1 )^(1/2).

问题

  1. What about the result by using stochastic gradient?

局限性

N/A

作者回复

Thanks for the positive evaluation of our work!

  1. Indeed, the theory won't be applicable in the nonconvex case. Convexity was instrumental in deriving all the proofs, and in the weakly convex case it is not clear how to proceed. However, we don't agree that this is a weakness. First of all, convex problems arise as subproblems in many applications such as bilevel optimization and optimal control, and black-box methods are of particular interest in such cases. Secondly, many optimization methods such as momentum and Nesterov's acceleration have been designed using the convex framework and have been used across problem classes.

    Moreover, we do not believe it is possible to achieve same adaptivity as in our method in the general nonconvex case. While we cannot prove this definitively, there are no works, at least to our knowledge, that demonstrate adaptivity (as defined in the paper) in the general nonconvex case with a sound theory and convergence rate. Thus, we believe it is unjust to criticize the paper for not achieving something that is currently beyond reach for all.

  2. Thank you for noticing the typo!

  3. Unfortunately, it is again unclear how to develop a sound theory in the stochastic case, the theory is lacking even for linesearch methods. The difficulty comes from the fact that convergence guarantee for SGD usually require the stepsize to either use the Lipschitz constant of the full gradient or the maximum over all Lipschitz constants across all stochastic samples. Empirical stochastic gradients would only give a rough estimate of these quantities, so it is not immediate how we could get solid convergence guarantees for such methods. However, we believe it is a very intriguing and interesting direction for future research and we hope that adaptive methods will be extended to this setting.

最终决定

After the rebuttal, all reviewers recommended acceptance.