Local Curvature Descent: Squeezing More Curvature out of Standard and Polyak Gradient Descent
We propose provably convergent adaptive methods under assumptions that reinforce smoothness and convexity with curvature information.
摘要
评审与讨论
This paper introduces notions of local/relative convexity and smoothness that generalize existing notions in the literature (See, for example RSN: Randomized Subspace Newton by Gower et al). Three algorithms are proposed which exploit this property, two of which have accompanying theoretical guarantees. A practical discussion of functions possessing these properties follows, and then the paper concludes with some illustrative numerical experiments.
优缺点分析
Strengths
- This paper is easy to read. I greatly appreciate the layout; the use of short sections and paragraph makes following along very easy.
- The literature review is current and informative, without overwhelming the reader with unnecessary citations.
- The proposed algorithms are interesting and appear easy to implement (at least when is diagonal).
- Overall, an interesting paper, which leaves many further research directions open to the reader.
Weaknesses
- Due to the length of the submission, I did not check the proofs closely, although they seem reasonable. I did find some minor typos (see questions), so I encourage the authors to address these and also double check their proofs.
- See "questions" for some minor points.
问题
In line 85, with respect to SP2, what is a localization set? Is it as defined in eq 7?
- Consider adding a remark immediately below Assumption 2.1 explaining how your definition of local curvature relates to the relative smoothness and relative convexity of Gower.
- Example 6.1: why would one want to consider the squared Huber loss? Are there any papers in the literature that do so? Similarly, perhaps a bit more motivation for absolutely convex functions could be provided.
- In figures 2 and 3, how do you compute the ground truth solution ? Is this the solution to the regularized binary classification problem? Moreover, I think would be a more reasonable thing to plot on the y axis, as this aligns better with your theorems.
- I think the right hand side of equation 19 is missing a factor of . Typos:
- line 69 'and a performing a linear solve' should be 'and performing a linear solve'.
- line 274: 'zre' should be 'are'
- line 284: 'thregularizer' should be 'the regularizer'
- line 312: 'performance.The' should be 'performance. The'
- line 872: 'Lemma B.1' should be 'By lemma B.1'
局限性
Yes.
最终评判理由
I maintain my original score, and think this paper is worth accepting. I thank the authors for answering my questions, and for implementing some of my suggestions.
格式问题
NA
We thank the reviewer for the time and effort they spent reading and engaging with our paper. We are glad that the reviewer found our paper and proposed algorithms to be interesting.
Relative smoothness
Consider adding a remark immediately below Assumption 2.1 explaining how your definition of local curvature relates to the relative smoothness and relative convexity of Gower.
Thank you for this suggestion. In the final paragraph of the subsection, we have a sentence that states the discussion of related assumptions is in Appendix A. In Appendix A, we provide a detailed explanation about the connection to relative convexity and smoothness. We will modify this sentence to explicitly state that we discuss relative convexity and smoothness in the appendix.
Questions
Example 6.1: why would one want to consider the squared Huber loss? Are there any papers in the literature that do so? Similarly, perhaps a bit more motivation for absolutely convex functions could be provided.
We provide the squared Huber loss as an example of a simple function that satisfies our assumption with and provides intuition about this new class of functions. The square of the Huber loss is not strongly convex because at . However, the function has a non-zero curvature matrix elsewhere, which can be employed by our algorithms. So this is a simple example where a function is not strongly convex (a common assumption in optimization) but satisfies our assumption with a non-trivial curvature matrix. In our assumption, the curvature matrix is allowed to vary with , which makes it much more flexible. So, many functions relevant to optimization are included in this new class. The intuition regarding absolutely convex functions follows similarly.
In figures 2 and 3, how do you compute the ground truth solution x^*? Is this the solution to the regularized binary classification problem? Moreover, I think would be a more reasonable thing to plot on the y axis, as this aligns better with your theorems.
For these experiments, we first run Newton's method to determine the approximate . The update step of LCD2 and Polyak step size method involves projections onto localization sets and is guaranteed to satisfy . We plot to highlight this fact.
I think the right hand side of equation 19 is missing a factor of -2. Thank you for verifying. That is a typo; the factor of -2 comes from Lemma B.1.
We appreciate the reviewer for finding the typos and have corrected them in an updated manuscript.
Concluding Remarks
We thank the reviewer again for their valuable feedback. We hope that our rebuttal addresses their questions and concerns, and we kindly ask the reviewer to consider a fresher evaluation of our paper if the reviewer is satisfied with our responses. We are also more than happy to answer any further questions that arise.
Thanks to the authors for addressing my concerns!
Look at a series of methods for optimizing functions with known point-wise quadratic lower bounds.
优缺点分析
This paper proposes a class of functions that generalize convexity by introducing a flexible quadratic lower bound into the usual lower and upper linearization bounds used in the analysis of optimization methods. The paper is well written and easy to read, clearly explaining the concepts introduced and providing detail where necessary.
I have a few concerns with this paper that make me lean towards rejection.
-
The actual matrix correction used regularizes the C matrix with , which for any interesting practical problem means that the actual step taken will be overwhelmingly corrected by not C, essentially becoming a diagonal step. This is just far too conservative to make for an interesting step size. This is why the rates obtained are just the same as you would get using a simple 1/L step size for LCD1 - the step will behave virtually the same as that step!
-
LCD2 is a trust region method, i.e. the reduction to a 1D optimization over the beta parameter to set the step size is just well known trust region theory. This relation needs to be fleshed out in the paper. The existing theory developed in this area in the last 4 decades is enormous.
-
LCD3 is interesting, but without either theory for it, or far more comprehensive experiments, the investigations for that method is not sufficient for publication. A paper focused on this method, with further developments would be interesting, but requiring knowledge of f* still makes it a niche method.
-
The set of problems where C is known and tractable is very small, and the further requirement that we know either L or f* narrows the set even further. It looks like it won't be easier to compute than the Hessian in various practical cases (such as logistic regression). A Hessian based trust-region baseline would need to be included if that's the problem being focused on.
I would be very (pleasantly) surprised if the proposed methods outperform LBFGS, the standard approach for solving non-stochastic optimization problems without complex constraints. Since matrix-calculations are used in formulations LCD2-3, a comparable SOTA quasi-newton approach would be a good comparison point. The baseline methods used are not generally SOTA and are interesting for different reasons. Polyak for its universality (good rates for non-smooth, which LBFGS struggles with), and MM for it's lack of any tuning parameters. In the case where , a line search baseline would be appropriate.
问题
N/A
局限性
N/A
最终评判理由
See comments
格式问题
Line 72-73 (put citations in parentheses).
We are appreciative of the time and effort the reviewer spent engaging with our work. We are glad that the reviewer found LCD3 to be an interesting method and our work to be well written and easy to read. We now address the reviewer's concern.
-smoothness vs -smoothness
The actual matrix correction used regularizes the C matrix with L, which for any interesting practical problem means that the actual step taken will be overwhelmingly corrected by L not C, essentially becoming a diagonal step...This is why the rates obtained are just the same as you would get using a simple 1/L step size for LCD1 - the step will behave virtually the same as that step!
Firstly, we emphasize that our rates are based on and we know that for functions that are smooth. Also, we would like to highlight that our goal is to leverage local curvature information to develop methods with adaptive step sizes. To that end, only one of our methods, LCD1, which is an analog of GD with constant step size, employs in the update step. LCD2 uses an adaptive term , and LCD3 is also a fully adaptive method that does not depend on . Furthermore, it is not the case that the update step is dominated by . To illustrate this point, we consider a simple ill-conditioned example. Let and be the huber loss,
Let . Then is L-smooth with . Additionally, satisfies Assumption 2.1 with and,
where if and otherwise. Therefore, we can see that the update step of LCD1 will not be dominated by term everywhere. Furthermore, we also have that and the gap can be made arbitrarly larger. The convergence rates we obtain for LCD1 and LCD2 are based on and not the classical smoothness constant. Therefore, we obtain better convergence rates up to constants than GD with constant or Polyak step size, especially when . Furthermore, analogous to the polyak step size, LCD2 can often perform better in practice than suggested by the theoretical convergence rate.
LCD2 vs trust-region methods
LCD2 is a trust region method, i.e. the reduction to a 1D optimization over the beta parameter to set the step size is just well known trust region theory.
We recognize the reviewer’s observation regarding potential connections between LCD2 and trust region methods. However, we respectfully disagree with the claim that LCD2 is a trust-region method. Trust region methods minimize a quadratic approximation of subject to a local radius constraint,
where is the region radius. On the other hand, the update step of LCD2 is projection onto an ellipsoid,
Instead of minimizing a quadratic function subject to a norm constraint on , we are minimizing subject to a quadratic constraint. While there are some similarities in form, the underlying optimization problems are different.
To solve the constrained optimization problem in the update step of LCD2, we apply standard techniques from optimization theory. We derive the corresponding dual problem in terms of a scalar Lagrange multiplier. This reduces the update step to finding the unique root of a 1D function (which is different than the 1D function you would obtain from the trust-region method). We show that Newton’s method can be used to find the root, with guaranteed convergence, as detailed in Appendix C.1, where we also note similar derivations appear in trust-region literature [5]. However, the use of Lagrange multipliers and Newton’s rooting finding method are general tools and are not exclusive to trust-region methods. While the projection onto the ellipsoid in our case also reduces to a scalar root-finding problem—resembling a trust-region step in form, this similarity alone does not make LCD2 a trust-region method.
LCD3 + Evaluations
LCD3 is interesting, but without either theory for it, or far more comprehensive experiments, the investigations for that method is not sufficient for publication.
We provide many experiments demonstrating the superior empirical performance of LCD3. We perform logistic regression, linear regression, and Huber loss regression experiments with various regularizers across several datasets. Our experimental evaluations are based on standard setups and commonly used datasets in convex optimization literature.
The set of problems where C is known and tractable is very small, and the further requirement that we know either L or f* narrows the set even further. It looks like it won't be easier to compute than the Hessian in various practical cases (such as logistic regression). A Hessian-based trust-region baseline would need to be included if that's the problem being focused on.
We understand the reviewer's concern but politely disagree. Our assumptions regarding local curvature are weaker than the well-studied strong convexity and smoothness in the literature, and recover convexity and smoothness as a special case. Also, we show interesting examples satisfying our assumptions, which are not strongly convex or smooth.
Additionally, in many cases, the curvature matrix is diagonal e.g. for -norms, so it is easier to compute than the full Hessian for -norm regularized problems. For objective functions expressed as a sum of squared absolutely convex functions , we have,
$ f(x) = \sum_{i=1}^n \phi_i(x)^2, \quad \mathbf{C}(x) = \sum_{i=1}^n \nabla\phi_i(x)\nabla_i\phi(x)^T. $Computing the sum of outer products may be cheaper than forming the full Hessian. Notably, even computing this sum over a mini-batch of indices yields a valid curvature matrix: satisfying Equation (5). Computing a mini-batch curvature matrix is more efficient, allowing for scalable approximations to the full curvature matrix in large-scale settings. A concrete instance of this setup is linear regression, where (which is absolutely convex), demonstrating a practical scenario.
Our empirical focus is on comparing our methods to classical counterparts like gradient descent with the Polyak step size, as our approaches extend these adaptive strategies. In our experimental settings, such as regularized logistic regression, where the curvature matrix comes from the regularizer rather than the full Hessian, our methods still outperform adaptive baselines, including in wall-clock time, demonstrating the practical benefit of employing a readily available local curvature matrix.
Knowledge of
We acknowledge the reviewer’s comment that requiring knowledge of the minimum value poses a potential limitation for LCD2 and LCD3, which are extensions of the Polyak step size. We would like to clarify that in many modern machine learning settings, this requirement may not be as restrictive as it initially appears. There has been a growing body of work on stochastic Polyak step sizes for finite sum minimization where the requirement is relaxed to knowing each individual . This relaxation is particularly relevant for overparameterized models, which are now widespread. In such models, the training data can often be fit exactly, implying that . As a result, future extensions of our work, particularly in the stochastic setting, can leverage these developments to broaden applicability. In light of this, we respectfully disagree that our methods are niche as there is a lot of promise for optimization methods in machine learning that use .
Another well-known example where we know that value of the global is binary classification with linearly separable data (). Additionally, when the value of is not known we can estimate it as where the sequence , , and refers to the smallest objective value in the trajectory so far [1]. Furthermore, several works have proposed strategies to employ the Polyak step size in the stochastic setting for finite sum minimization without knowledge of . A common approach is introducing slack constraints, i.e. finding the smallest possible such that [2] [3] [4] . Stochastic Polyak methods require knowledge of each , which is a weaker condition than knowing the global .
Concluding remarks
We hope that our responses were sufficient in clarifying all the great questions asked by the reviewer. We thank the reviewer again for their time, and we politely encourage the reviewer to consider updating their score if they deem that our responses in this rebuttal merit it.
References
[1] Stephen Boyd. Lecture notes: Subgradient methods, Jan. 2014. [2] R. M. Gower, A. Defazio, and M. Rabbat. Stochastic Polyak Stepsize with a Moving Target, Sept. 2021. [3] R. M. Gower, M. Blondel, N. Gazagnadou, and F. Pedregosa. Cutting Some Slack for SGD with Adaptive Polyak Stepsizes, May 2022. [4] S. Li, W. J. Swartworth, M. Takác, D. Needell, and R. M. Gower. SP2, July 2022. [5] Nocedal and Wright. "Numerical Optimization". 2006
That's for answering my points in detail. Have you had a chance to run experiments comparing against LBFGS? I would raise my score if the proposed method is comparable or outperforms it.
We thank the reviewer for engaging with us during the discussion period. We provide the comparison between L-BFGS and our adaptive methods on our experimental setups below. We would like to politely remind the reviewer that new rebuttal rules prevent the sharing of images. Therefore, we provide tables showing the value of at the last iterate .
Additional experiments
Logistic regression with regualrization on mushroom dataset. As done in the paper, we compute as a factor of the -smoothness constant of the logistic regression instance. The first table is with and second table is with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Logistic regression with regularization on a2a dataset. The first table is with and second table is with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Sum of square of Huber Loss (absolutely convex) on mpg dataset with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Sum of square of Huber Loss (absolutely convex) on mpg dataset with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Logistic regression with regualrization on mushroom dataset. The first table is with and second table is with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
We find that our adaptive methods are competitive with L-BFGS and can outperform it, demonstrating that local curvature is a useful notation in settings where our assumptions hold.
Justifying Original Baselines
Furthermore, we would like to emphasize that our focus is on developing adaptive algorithms with strong theoretical gaurantees. We find that enhancing the classical notions of convexity and smoothness with a local curvature matrix allows us to derive extensions of algorithms such as GD with Polyak step size with provable convergence. In fact, we achieve better convergence rates upto constants because we have and we presented a simple example earlier where . Hence, our original baselines are against methods that are theoretically grounded. We did include the BB step size, but as Reviewer GhWh mentioned, that may not be as valuable as other baseline methods since the BB step size does not have provable convergence in the convex smooth regime.
Up until recently, quasi-Newton methods were merely effecient heuristics with weak theory beyond quadratics [1, 2]. While some quasi-Newton have super-linear convergence (although not the diagonal or limited memory variants used in practice) but those guarantees only hold locally when close to the minimum. To work when far from a solution, those methods require “globalization” modifications, such as regularization or a line-search. Unfortunately, analyses of second-order methods do not capture the global benefit of preconditioning and instead lead to worse rates than gradient descent [3, 4, 5].
[1] Kovalev et al., "Fast Linear Convergence of Randomized BFGS" 2020.
[2] Rodomanov and Nesterov, "Greedy Quasi-Newton Methods with Explicit Superlinear Convergence". 2021.
[3] Byrd et al., "A Stochastic Quasi-Newton Method for Large-Scale Optimization". 2016
[4] Bollapragada et al., "A Progressive Batching L-BFGS Method for Machine Learning". 2018.
[5] Meng et al., "Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation". 2020.
Dear Reviewer bXSH,
Thank you for the review. The (extended) deadline for the Author–Reviewer discussion is approaching. Please take a moment to review the authors’ latest response and let us know whether your concerns have been satisfactorily addressed.
Best, AC
The authors have addressed my comments, I have considered their rebuttal and revised my score.
The paper considers iterative procedures for minimizing convex functions that satisfy growth conditions that are more general than standard (strong) convexity and smoothness. The contributions of this paper are in proposing these new growth conditions and methods that leverage this in the manner of gradient descent for smooth functions.
优缺点分析
Notwithstanding the comments on writing below, I think the paper is very clear and easy to understand. I think stating the results for when is useful, and provides a more comprehensive picture. I think the notions of growth proposed are interesting and different to the more canonical ideas in convex optimization. The methods proposed are rather interesting too and original, particularly LCD2 and LCD3.
Writing
I think the paper can be organized a little better; there are a variety of items scattered across many sections in the paper. Here are some remarks that the authors can perhaps use to improve their manuscript.
-
The summary of contributions can be made more concise given the page limit, and I feel that the section could be renamed to be more representative of the key assumption which is new.
-
The calculus of these properties, and examples of functions (including absolutely convex functions) satisfying the new growth conditions in (5) and (6) can be discussed as a subsections of the new Section 2 proposed above.
- The property about square of an absolutely convex function being convex is interesting and should be included with Lemma 7.1.
-
The LCD method can discussed right after with theorems proving convergence guarantees in the following section, followed by an empirical section.
In my mind, this has a cleaner presentation of the contributions of this work.
问题
-
How does (5) compare with self-concordance?
-
Backtracking line search can also be viewed as taking advantage of the local curvature (depending on the criterion used for backtracking), so how does this method compare to backtracking line search for convex, smooth functions?
-
While the conciseness of the update for LCD2 is nice, it would be useful to see how is defined without having to refer to the appendix as it is an essential part of the update. It is also not immediately obvious why LCD2 becomes GD with Polyak stepsize when -- perhaps this has to do with the form of which emphasizes the need to include it in the main text.
-
How is LCD3 defined when is non-invertible? Is it the case that there might be more than one solution to the problem, and in that case would any solution work?
-
I can see how LCD1 would be a descent method (i.e., ensures continuous decrease), but can the same be said for LCD2?
-
Comparing against Polyak's method for problems with regularization in Figures 3 and 4 is not a fair comparison in my opinion; considering this is not a setting where the gradients are not bounded (see https://arxiv.org/pdf/1905.00313 for a discussion). I think it helps point out how the local curvature conditions might be useful and it would have been more interesting to see a comparison against (damped) Newton's method or a quasi-Newton method.
-
How can you potentially circumvent the cost for in the methods, especially relative to Newton's method which also suffers from the same complexity. This leads me to another question (stated below).
-
The elegance of the Newton's method / quasi-Newton method is that with an oracle for Hessians and gradients, the method is pretty much black-box. In this case, there is a need to identify for a given problem specification, and the authors discuss the case of . Can identifying be more "automatic" in some way?
-
Why is the BB method behaving so irregularly? Is it an implementation difficulty or a procedural issue with BB?
局限性
Yes
最终评判理由
I have had positive discussions with the authors and would recommend acceptance.
格式问题
NA
We thank the reviewer for engaging with our paper during the review period! We are glad that the reviewer found our notions of growth to be different to the more canonical ideas in convex optimization and our methods interesting and original. We now address the reviewer's concerns.
Improving Organization
We appreciate the reviewer's detailed feedback on the organization of our paper. In particular, we will modify the section regarding summary of contributions and update Lemma 7.1 to include that the square of an absolutely convex function is convex which the reviewer found to be interesting.
Questions
How does (5) compare with self-concordance?
Self-concordance requires the function to be thrice differentiable and bounds the third derivative (which measures change in curvature) in terms of the second derivative i.e. . Specifically, self-concordance allows for better convergence analysis of Newton's method and related quasi-Newton methods. Although the self-concordance property bounds change in curvature, it is a different concept from (5). Assumption 2.1. does not involve the third derivative, but from Lemma E.6 and E.10 in the Appendix, we do know that,
While the local curvature matrix provides bounds on the Hessian, it does not say anything about third-order information.
While the conciseness of the update for LCD2 is nice, it would be useful to see how is defined without having to refer to the appendix, as it is an essential part of the update. It is also not immediately obvious why LCD2 becomes GD with Polyak stepsize when C = 0. Perhaps this has to do with the form of which emphasizes the need to include it in the main text.
We acknowledge the reviewer's concern regarding the clarity of the update step of LCD2. When minimizing a convex function , the Polyak step size corresponds to projecting the current iterate, , onto the halfspace,
This projection problem can be solved in closed form, and the update step corresponds to GD with Polyak step size.
The update step of LCD2 corresponds to projecting the current iterate, onto the ellipsoid,
When , our assumption reduces to convexity and L-smoothness and becomes . Then it is easy to see that the projection problems are the same so LCD2 reduces to GD with Polyak step size. However, LCD2 does not have a closed-form update step as the reciprocal of is a unique root of a scalar equation that does not have a closed-form solution. We will update Section 3.2 of the paper to include the more detailed explanation above.
How is LCD3 defined when C is non-invertible? Is it the case that there might be more than one solution to the problem, and in that case would any solution work?
In that special case, we can use the pseudo-inverse of instead which means that there is not a unique solution.
I can see how LCD1 would be a descent method (i.e., ensures continuous decrease), but can the same be said for LCD2?
It is true that LCD2 and LCD3 are not guaranteed to decrease the function value at each step. LCD2 and LCD3 project the current iterate onto the localization set, which does not guarantee descent. For a specific example, it is well known that GD with Polyak step size is not a descent method [1] and LCD2 reduces to polyak step size when .
Comparing against Polyak's method for problems with regularization in Figures 3 and 4 is not a fair comparison in my opinion; considering this is not a setting where the gradients are not bounded.
We acknowledge the reviewer's concern about the unbounded gradients in some of our experiments. However, the Polyak step size is special and interesting because it is very robust in many settings, including problems that are not L-smooth or do not have bounded gradients. Many works have proposed stochastic extensions of the Polyak step size to train neural networks. Therefore, we do believe it is interesting and appropriate to compare how LCD2 and LCD3 (which are extensions of Polyak step size with local curvature) perform in the regularization setting.
How can you potentially circumvent the cost for O(d^3) in the methods, especially relative to Newton's method which also suffers from the same complexity. This leads me to another question (stated below)
We acknowledge the concern regarding the computational cost associated with matrix inversion. However, in many of our examples, the curvature matrices often exhibit special structure, such as being diagonal, or even rank-one, which allows for efficient computation of .
The elegance of the Newton's method / quasi-Newton method is that with an oracle for Hessians and gradients, the method is pretty much black-box. In this case, there is a need to identify C for a given problem specification, and the authors discuss the case of . Can identifying C be more "automatic" in some way?
We appreciate the reviewer’s question regarding the potential for automatically determining the local curvature matrix . While this is an interesting and valuable direction, it falls outside the scope of the present work. We comment that for curvature matrices of the form , it may be feasible to perform power iteration on the Hessian at current iterate to obtain an estimate of the local strong convexity constant, .
We focus on the setting where such a curvature matrix is assumed to be available. The main contribution of our work is introducing and demonstrating the value of this assumption. First, we show that in many convex optimization problems such as p-norms, certain convex loss functions, and squares of absolutely convex functions, the curvature matrix is readily available. Next, under these local curvature assumptions, we develop theoretically grounded adaptive algorithms such as LCD2 that improve upon their classical counter-parts such as GD with Polyak step size.
Exploring heuristic or data-driven methods for approximating C is indeed an intriguing direction, and we view this as promising future work. However, our current contribution and focus is developing principled algorithms built on rigourous theoretical foundations.
Why is the BB method behaving so irregularly? Is it an implementation difficulty or a procedural issue with BB?
The BB step size does not have convergence guarantees for convex smooth problems. Therefore, for certain problems, the BB step size method may behave irregularly.
Concluding Remarks
We thank the reviewer again for their valuable feedback. We hope that our rebuttal addresses their questions and concerns, and we kindly ask the reviewer to consider a fresher evaluation of our paper if the reviewer is satisfied with our responses. We are also more than happy to answer any further questions that arise.
References
[1] Stephen Boyd. Lecture notes: Subgradient methods, Jan. 2014.
Thank you for the rebuttal, and I've gone over it carefully.
-
[In reference to self-concordance]: yes, self-concordance does involve the third derivative, but the implication of self-concordance in giving a quadratic lower bound as for (c.f. Eq. 5.1.14, Nesterov [2018]) is mostly what I was intending to refer (apologies for not being specific enough).
-
[In reference to singularity of ]: I would recommend commenting on this in the paper.
-
[In reference to comparing with Polyak's step size]: While I do think that the experiments are valuable, a setting where the gradients are bounded would be valuable to have as well as this gives a more complete picture.
-
[In reference to computational cost]: Would be useful to remark on settings where this is true, or instead include a discussion about the computational cost in the examples discussed in your paper.
-
[In reference to BB]: If the BB method doesn't have guarantees, then this is not a valuable baseline in my opinion (since all the methods being compared do have guarantees).
We thank the reviewer for engaging with us during the discussion period and we appreciate the detailed constructive feedback.
yes, self-concordance does involve the third derivative, but the implication of self-concordance in giving a quadratic lower bound.
We thank the reviewer for the reference. While the lower bound derived from self-concordance does use the norm induced by the Hessian at , it does not square the norm term and thus does not yield a truly quadratic lower bound on . For illustration, consider the simple case where for some . In this case, the self-concordant bound involves , which grows linearly with . In contrast, our assumption using a curvature matrix leads to a lower bound involving , which grows quadratically with distance and is tighter, particularly when is large.
Singularity of C: I would recommend commenting on this in the paper.
We will include a discussion on the potential singularity of in both the LCD3 subsection and the empirical section to clarify how it is handled in our experiments.
[In reference to comparing with Polyak's step size]: While I do think that the experiments are valuable, a setting where the gradients are bounded would be valuable to have as well as this gives a more complete picture.
We appreciate the reviewer’s feedback on our empirical evaluations. As noted in [1], the Polyak step size achieves the optimal convergence rate among gradient descent methods for convex and -smooth objectives. Several of our experiments such as logistic regression with regularization, ridge regression, and the sum of squared Huber losses are instances where the objective is indeed -smooth. Therefore, we believe our comparisons against gradient descent with the Polyak step size are both fair and meaningful in the regime where its theoretical guarantees apply. Additionally, we have updated our baselines to include L-BFGS to further strengthen the empirical comparison.
[In reference to computational cost] Would be useful to remark on settings where this is true, or instead include a discussion about the computational cost in the examples discussed in your paper.
We thank the reviewer for the suggestion. We do include remarks on the computational cost of the update steps at the end of the experimental section. To improve clarity, we will revise the manuscript to move these comments to the beginning of the empirical section. For example, in the case of norm regularization, the curvature matrix is diagonal, allowing its inverse to be computed efficiently in time.
If the BB method doesn't have guarantees, then this is not a valuable baseline in my opinion (since all the methods being compared do have guarantees).
We appreciate the reviewer’s feedback on the choice of baselines and agree that, in our context, methods with strong theoretical guarantees are particularly valuable. Nevertheless, we chose to include the Barzilai-Borwein (BB) step size as it is a well-known practical heuristic. Performance comparisons against practical heuristic step sizes may still offer valuable insights for some readers. For example, Reviewer bXSH specifically requested a comparison against L-BFGS (with Wolfe line search), which we have now included in our baselines. We encourage the reviewer to refer to our results in discussion with Reviewer bXSH, where our adaptive methods outperform or remain competitive with L-BFGS.
We also note that while quasi-Newton methods can enjoy super-linear convergence, these guarantees typically hold only locally, i.e., when iterates are sufficiently close to the optimum. However, such methods require globalization techniques (e.g., line search or regularization) to ensure convergence performance when far from the solution. But existing theoretical analyses of quasi-Newton methods often fail to reflect the empirical benefit of preconditioning and lead to worse global convergence rates than gradient descent [2, 3, 4].
In a future version, we may consider replacing the BB baseline with a quasi-Newton method as a more theoretically grounded point of comparison.
[1] Hazan and Kakade, "Revisiting the Polyak Step Size". 2022.
[2] Byrd et al., "A Stochastic Quasi-Newton Method for Large-Scale Optimization". 2016
[3] Bollapragada et al., "A Progressive Batching L-BFGS Method for Machine Learning". 2018.
[4] Meng et al., "Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation". 2020.
Dear Authors,
Perhaps I am misunderstanding something. Self-concordance does imply a quadratic lower bound; indeed, a self-concordant function is locally strongly convex as long as its Hessian is nondegenerate. See, for example, Chapter 5.1.4 in Lectures on Convex Optimization by Nesterov. Note that the function therein should be considered a quadratic function around 0.
Best, AC
We thank the AC for their insightful comments. We have elaborated on our assumption and the implications of self-concordance below for further clarification.
Indeed, a self-concordant function does admit a quadratic lower bound, but this guarantee is local. Specifically, when , the function satisfies . So when the norm is not too large, we obtain a quadratic lower bound determined by the curvature at i.e. the Hessian at . So the lower bound implies local strong convexity if the Hessian at is non-degenerate. But for larger , we know that so the bound potentially degrades to a linear lower bound as moves farther away from . Hence, self-concordance ensures only a local quadratic lower bound with the norm induced by the local curvature at .
In our earlier response to the reviewer, we intended to convey that self-concordance does not typically imply a global quadratic lower bound. We apologize for not stating this more precisely.
We would like to emphasize a key distinction highlighted in the introduction of our paper: equation (5) in Assumption 2.1 extends convexity which is a global property, and thus, eq. (5) is also a global property, captured by the fact that we require it to hold for all . However, the curvature matrix depends on , which refers to the current iterate of our algorithms. So the curvature matrix depends on the "locus" of the algorithms, which is why we use the terminology "local". This is not to be misunderstood or misinterpreted for "local" reach, such as in the sense of local strong convexity. Moreover, another important difference is that the norm in our lower bound is not necessarily induced by the Hessian at . Our framework allows the use of any positive semidefinite curvature matrix that satisfies the lower bound, enabling greater flexibility. This generality allows a broader class of functions to satisfy our assumption, including ones that are not self-concordant.
To illustrate with a simple example, the square Huber loss for is not self-concordant. When , we have and which violates the self-concordance inequality when . However, it satisfies our assumption as shown in Example 6.1.
I think it helps point out how the local curvature conditions might be useful and it would have been more interesting to see a comparison against (damped) Newton's method or a quasi-Newton method.
We thank the reviewer for suggesting a comparison against a quasi-Newton baseline. As requested by another reviewer, we include results with L-BFGS in our experimental setup below. We would like to politely remind the reviewer that new rebuttal rules prevent the sharing of images. Therefore, we provide tables showing the value of at the last iterate .
Additional experiments
Logistic regression with regualrization on mushroom dataset. As done in the paper, we compute as a factor of the -smoothness constant of the logistic regression instance. The first table is with and second table is with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Logistic regression with regularization on a2a dataset. The first table is with and second table is with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Sum of square of Huber Loss (absolutely convex) on mpg dataset with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Sum of square of Huber Loss (absolutely convex) on mpg dataset with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
Logistic regression with regualrization on mushroom dataset. The first table is with and second table is with .
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
| LCD2 | LCD3 | L-BFGS |
|---|---|---|
We find that our adaptive methods are competitive with L-BFGS and can outperform it, demonstrating that local curvature is a useful notion in settings where our assumptions hold.
Concluding Remarks
We thank the reviewer again for their feedback, which allowed us to strengthen our work with additional clarifications and experiments that we will include in the updated paper. We also invite the reviewer to ask any other lingering questions they may have or to potentially consider a fresher evaluation of our paper with these rebuttal responses in context.
This article introduces a new class of functions that simultaneously generalize/restrict convexity and L-smoothness and derive better optimization (minimization) algorithms for these functions. The key idea is to introduce the existence of a local quadratic curvature term and a fixed constant L that allows one to both upper bound and lower bound a function f. The lower bound is a linear approximation to f at x plus the local curvature term, while the upper bound is the same linear approximation plus a higher curvature term equal to the original plus L times an identity matrix. When the curvature map is 0, then the lower bound generalizes/restricts convexity, while the upper bound generalizes L smoothness.
The article then continues to exploit the properties of this class of functions to yield new optimization algorithms that can exploit the existence of the local curvature map. They derive 3 algorithms:
Local curvature descent 1 (LCD1): minimize the upper bound. This generalizes gradient descent with constant step size 1/L.
Local curvature descent 2 (LCD2): Euclidean projection onto an ellipsoid in which the lower bound is less than the (known) minimal value of f. This generalizes gradient descent with Polyakov step-size which also requires knowledge of the known minimum value.
Local curvature descent 3 (LCD3): this is LCD2 with Euclidean projection replaced by a projection determined by the local curvature term, which has a closed form solution.
The paper nicely derives convergence rates and proves them for LCD1 and LCD2.
They then show that their new class of functions is not trivially empty or uninteresting by constructing general classes of functions for which a local curvature term satisfying the two requisite inequalities exists. These include for example, Huber loss, L_p regularized logistic regression, squared p-norm, and sums of squares of absolutely convex functions.
Finally they perform numerical experiments on L2 regularized logistic regression and show that some of their LCD algorithms exhibit faster convergence than GD with Polyak step size.
优缺点分析
Strengths:
- The new class of functions introduced seems quite nice.
- The convergence proofs for two of their algorithms are solid results.
- Showing the class of functions is not trivially empty/uninteresting is nice.
- The paper is reasonably clearly written.
- The numerical experiments are relatively extensive at least on the limited classes of functions that satisfy their assumption.
Weaknesses:
- The field of optimization theory really needs to focus on more natural classes of functions that appear in practice. It is less clear how natural the class of functions introduced is. For example, most functions we are interested in are not convex.
- The method does not extend to stochastic methods, but that can safely be considered out of scope as an entirely new paper would have to be written to cover that.
- Their algorithms that perform well, namely LCD2 and LCD3 require knowledge of the known minimum value.
问题
You have shown a few examples of classes of functions that satisfy your assumption 2. I would love to understand better how to understand how general this class is. Are there even more qualitatively different optimization problems that people care about in practice that satisfy your assumptions? I suspect this is a difficult question though. But anything that can be said here would improve the significance and impact of the paper.
局限性
Yes.
格式问题
None.
We thank the reviewer for spending the time and effort engaging with our work. We appreciate that the reviewer found our convergence proofs of LCD1 and LCD2 to be solid results and our numerical experiments to be relatively extensive on functions that satisfy our assumption.
Justifying Local Curvature
The field of optimization theory really needs to focus on more natural classes of functions that appear in practice. It is less clear how natural the class of functions introduced is. For example, most functions we are interested in are not convex.
We acknowledge the reviewer's concern that the relevance of convex optimization theory for modern machine learning is not initially apparent. However, we would like to politely push back on this assertion, as developing theory in the convex setting is important for understanding methods and can lead to insights in more complex regimes, such as the non-convex case. Various adaptive methods that are the workhorse for modern machine learning, such as Adagrad (and its variants such as Adam) were developed with theory in the convex setting. Other powerful adaptive step sizes such as the MM step-size [6] and the Polyak step size were developed with theory for the convex setting.
Furthermore, we believe that Assumption 2.1 offers a natural way to incorporate a notion of curvature into the classical framework of convexity and smoothness without going down the route of fully-fledged second order methods. We demonstrate its applicability through several relevant examples in convex optimization, including -norms, convex loss functions, and squares of absolutely convex functions, where this notion of local curvature can be effectively exploited to design adaptive algorithms with superior performance. Interestingly, some curvature matrices arising in these examples often exhibit special structure,such as being diagonal—which is particularly relevant in machine learning settings. For instance, recent work [5] has shown that the Hessians of deep neural networks often have a block-diagonal structure. A better understanding of these diagonal patterns can facilitate the design of more efficient and scalable adaptive optimization methods.
Extensions to Stochastic Setting
We agree that extensions to the stochastic setting are a very interesting and natural research direction that can be explored in future work. Specifically, one could explore the case of finite-sum minimization where we only have access to a stochastic version of the local curvature matrix.
Knowledge of
We acknowledge the reviewer’s comment that requiring knowledge of the minimum value poses a potential limitation. We would like to clarify that in many modern machine learning settings, this requirement may not be as restrictive as it initially appears.There has been a growing body of work on stochastic Polyak step sizes for finite sum minimization where the requirement is relaxed to knowing each individual . This relaxation is particularly relevant for overparameterized models, which are now widespread. In such models, the training data can often be fit exactly, implying that . As a result, future extensions of our work, particularly in the stochastic setting, can benefit from utilizing these techniques. Another well-known example where we know that value of the global is binary classification with linearly separable data ().
Additionally, when the value of is not known we can estimate it as where the sequence , , and refers to the smallest objective value in the trajectory so far [1]. Furthermore, several works have proposed strategies to employ the Polyak step size in the stochastic setting for finite sum minimization without knowledge of . A common approach is introducing slack constraints, i.e. finding the smallest possible such that [2] [3] [4] . Stochastic Polyak methods require knowledge of each , which is a weaker condition than knowing the global .
Questions
I would love to understand better how to understand how general this class is. Are there even more qualitatively different optimization problems that people care about in practice that satisfy your assumptions? I suspect this is a difficult question though. But anything that can be said here would improve the significance and impact of the paper.
We believe that exploring the sum of squares of absolutely convex functions, , has a lot of potential,
For example, if (which is absolutely convex), we obtain linear regression with as the Hessian, so interestingly, our methods reduce to Newton's method. Instead, notice that if we select a specific we have that satisfies Equation (5) in Assumption 2.1 since for all ,
Although produces a looser lower bound, it is still a valid curvature matrix and is cheaper to compute than the full curvature matrix, which can be the Hessian in special cases. This opens up possibilities to explore the empirical performance and convergence properties of methods that randomly select indices/data points to form local curvature matrices. Such a family of methods would be relevant to large-scale machine learning problems (where is large) where forming or inverting the Hessian/full curvature matrix is not computationally feasible.
Another interesting result is that our class of functions leads to a convergence rate of for LCD1 and LCD2. We could have problems where , which results in our methods having better convergence rates than their classical counterparts. To provide intuition with a simple example, consider where and is the Huber loss function. Then is L-smooth with . Additionally, satisfies Assumption 2.1 with and,
where if and otherwise.
Concluding remarks
We thank the reviewer for their time and effort in reviewing our paper. We hope that our rebuttal here answers all the great points raised by the reviewer, allowing them to potentially upgrade their score if our responses merit it. We are also more than happy to answer any further questions that arise in the rebuttal period. Please let us know.
References
[1] Stephen Boyd. Lecture notes: Subgradient methods, Jan. 2014.
[2] R. M. Gower, A. Defazio, and M. Rabbat. Stochastic Polyak Stepsize with a Moving Target, Sept. 2021.
[3] R. M. Gower, M. Blondel, N. Gazagnadou, and F. Pedregosa. Cutting Some Slack for SGD with Adaptive Polyak Stepsizes, May 2022.
[4] S. Li, W. J. Swartworth, M. Takác, D. Needell, and R. M. Gower. SP2: A Second Order Stochastic Polyak Method, July 2022.
[5] Dong et al. "Towards Quantifying the Hessian Structure of Neural Networks" 2025.
[6] Malitsky and Mishchenko, "Adaptive Gradient Descent without Descent". 2019
[5] Dong et al. "Towards Quantifying the Hessian Structure of Neural Networks" 2025.
Thank you for your response. It is helpful. I am already relatively positive on this paper and will retain my score of borderline acceptance.
Dear Reviewer NYA3,
Please read the author rebuttal and let us know whether your concerns have been satisfactorily addressed. If not, kindly point out what remains inadequate.
The (extended) deadline for the Author-Reviewer discussion is approaching. If you do not respond, I will need to apply the “insufficient review” flag in accordance with the NeurIPS 2025 policy.
You can flag non-participating reviewers with the InsufficientReview button, and we will use the flag in accordance with possible penalties of this year's responsible reviewing initiative and future reviewing invitations.
Best, AC
Dear Reviewer ch1b and Reviewer NYA3,
The Author-Reviewer discussion period has begun and will end on August 6. Please read the rebuttal and let us know whether it satisfactorily addresses your concerns. If not, could you specify what remains inadequate? Your response will help us evaluate the paper and assist the authors in improving their work.
Please avoid responding at the last minute, as the authors may not have sufficient time to clarify your concerns.
Thank you!
Best, AC
Consider the problem of minimizing a convex differentiable function over . This paper proposes a novel assumption, Assumption 2.1, which replaces the squared Euclidean norm terms in the standard smoothness and strong convexity assumptions with norms defined with respect to and , respectively, for some . Here, is a positive-definite matrix-valued mapping. The paper then analyzes two algorithms, LCD1 and LCD2, which are analogous to vanilla gradient descent and gradient descent with the Polyak step size, respectively. It also proposes the LCD3 algorithm, which lacks a convergence guarantee but performs well in numerical experiments.
The proposed assumption is novel, natural, and interesting. I believe it provides valuable insights into understanding adaptive step sizes in convex optimization. The significance of the assumption is supported by several examples in Sections 6 and 7, though they are arguably quite similar. The suggestion of utilizing the proposed assumption by adding a regularizer in Section 6 is reasonable, but -norm regularizers for are not very popular. The algorithm LCD2 inherits the limitation of the Polyak step size in that it requires knowledge of , as pointed out by Reviewer bXSH.
Both Reviewer GhWH and Reviewer bXSH are concerned with the comparison to quasi-Newton methods. The authors have shown in the rebuttal that their algorithms are competitive with L-BFGS.
Given the above, and the observation that three out of the four reviewers gave a score of 4, reflecting a conservative attitude toward accepting this paper, I suggest accepting this paper as a poster but would not mind if my recommendation were bumped down.