/10

Rejected5 位审稿人

最低1最高4标准差1.0

ICML 2025

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower,Guillaume Garrigos,Nicolas Loizou,Konstantin Mishchenko,Dimitris Oikonomou,Fabian Schaipp

OpenReview PDF

提交: 2025-01-09更新: 2025-06-18

TL;DR

How to use the knowledge of optimal values to design efficient optimization methods

摘要

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS*. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have a $\mathcal{O}(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

关键词

Polyak step sizemomentumoptimization theorymodel distillation

评审与讨论

审稿意见

评分: 32025-03-13

This paper analyzes the stochastic gradient method with Polyak stepsizes, termed SPS*, for minimizing the expectation of a family of convex functions $f_{\xi}$ . The method adaptively selects stepsizes based on knowledge of $f_{\xi}(x^*)$ , the function value at the optimal point. The authors demonstrate that SPS* is simultaneously adaptive in various settings, including nonsmooth, smooth, and strongly convex objectives. It achieves the same convergence rates as SGD with well-tuned stepsizes without requiring knowledge of problem-specific constants such as the distance to the solution or the Lipschitz constant. The paper also extends SPS* by incorporating momentum, achieving nearly the same convergence rates but valid for the last iterate instead of the average one. Computational experiments illustrate that for some problems, $f_{\xi}(x^*)$ can be well-approximated, enabling efficient implementation of SPS* with good performance.

给作者的问题

Please address the weaknesses noted above, particularly the incorrect convergence rates throughout the paper.
Why does Table 1 mark SGD with an "X" in the first column? SGD can also be constrained to the $D$ -ball if necessary.
What is the motivation for defining $\sigma_*^2 = \inf f - \mathbb{E}[\inf f_{\xi}]$ in Corollary 2.3? Why not define it using inequality (13)? The current definition excludes important cases, such as when $L \sigma_*^2 = \mathbb{E}[\\| g_{\xi}(x^\*) \\|^2]$ .
What is the purpose of introducing Bregman divergence terms in (20)?
What happens in Theorem G.1 when $B = 0$ ? This case is not currently covered.

论据与证据

The claims made in the paper are well-supported, with each theorem and lemma accompanied by a corresponding proof.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem at hand.

理论论述

I have reviewed the main theoretical claims and assessed the general idea behind each proof. However, I did not verify all details rigorously.

实验设计与分析

Since the paper is primarily theoretical, I did not conduct an in-depth review of the experimental designs.

补充材料

I reviewed the supplementary material, including the appendix, and inspected each proof.

与现有文献的关系

This work builds on the SPS* method, previously studied in (Gower et al., 2021; Garrigos et al., 2023). Some of the results may have been known in special cases (e.g., for Lipschitz functions or smooth functions under the interpolation assumption), but to my knowledge, they have not been established under the more general assumptions considered in this paper.

遗漏的重要参考文献

I have not identified any essential references missing from the discussion.

其他优缺点

Strengths:

The paper presents compelling arguments for the adaptivity of the stochastic Polyak method across key optimization settings.
The proofs are generally simple and easy to follow, allowing interested readers to verify the details.
The authors acknowledge the primary limitation of SPS*—the requirement for $f_{\xi}(x^*)$ —and discuss scenarios where this information is available.

Weaknesses:

Several mistakes appear in the reported convergence rates, particularly in Table 1. For example:
1. The convergence rate of SGD (and SPS*) for $L$ -smooth problems should be $\frac{L D^2}{t} + \sqrt{\frac{L D^2 \sigma_\*^2}{t}}$ , not $\frac{L D^2}{\sqrt{t}} + \frac{\sigma_\*^2 \log t}{L \sqrt{t}}$ . In particular, for $\sigma_* = 0$ , this should recover the standard rate $\frac{L D^2}{t}$ of gradient descent.
2. For $L$ -smooth strongly convex problems, the complexity to reach an $\epsilon$ -solution (in terms of function value) should be $\frac{L}{\mu} \log \frac{f(x_0) - f^*}{\epsilon} + \frac{\sigma_\*^2}{\mu \epsilon}$ , not $\frac{L^2 D^2}{\mu^2 t^2} + \frac{\sigma_\*^2}{\mu^2 t}$ . In particular, for $\sigma_* = 0$ , this should recover the standard estimate for gradient descent.
3. The reported rates for SPSmax and NGN appear incorrect, particularly as they are not invariant under function scaling. Also, $\sigma_{pos}$ is likely equivalent to $\sigma_*^2$ .
The novelty of this work compared to previous SPS* results (e.g., Gower et al., 2021; Garrigos et al., 2023) is not entirely clear. While the results seem more general, it would help to explicitly clarify their differences.
The impact of mini-batching on method performance is unclear. Clarification would be useful, especially given that for SGD, condition (9) is usually written in terms of variance rather than second moments, showing that both $A$ and $B$ decrease with increased mini-batch size.

其他意见或建议

There are inconsistent definitions of $\sigma_*$ throughout the paper (e.g., in Table 1, Corollary 2.3, and Theorem E.1).
Some rates in Table 1 are stated in terms of function suboptimality, while others use distance to the solution. A consistent presentation using function suboptimality would be preferable.
Line 142 is missing references to recent works on the Polyak stepsize for $(L_0, L_1)$ -smooth optimization:
- Gorbunov et al. Methods for Convex $(L_0, L_1)$ -Smooth Optimization: Clipping, Acceleration, and Adaptivity, 2024.
- Vankov et al. Optimizing $(L_0, L_1)$ -Smooth Functions by Gradient Methods, 2024.
Line 301: Likely meant to reference (13) instead of (14).
Line 308: The statement "Note that SGD ..." should be checked for accuracy.
Theorem E.1: The given estimate is incorrect. The correct bound should be $O(\frac{L D^2}{T} + \frac{\sigma_* D}{\sqrt{T}})$ .

作者回复

2025-03-28

Thank you for your detailed review. It is a pleasure to have an expert reviewer. Below we address each of your questions.

Q1 mistakes …in Table 1

The rates we report throughout are the anytime rates of each method. We do not report the fixed horizon rate. We will now include the following exert in our revision:

We say $x_t$ has an anytime convergence rate if, once parameters are fixed, there exists a real function $r:[0,+\infty) \to [0,+\infty)$ such that $\mathbb{E}[f(x_t) - f(x)] \leq r(t)$ and $r(t) \to 0$ as $t \to + \infty$ .

We say an algorithm has a finite horizon rate if, for every tolerance $\epsilon>0$ , there exists a choice of parameters and there exists $T \geq 1$ such that $\mathbb{E}[f(x_T) - f(x)] \leq \epsilon$ .

We focus on anytime rates because these are the rates that best translate into practical performance. Where-as fixed horizon rates depend on very small step sizes, and are completely impractical when implemented.

For example, SGD with a constant stepsize does not have an anytime convergence rate, but it has a finite horizon rate. The anytime rate for SGD on convex smooth problems is $O(\log(t)/\sqrt{t})$ (Theorem 5.7 in [Garrigos & Gower, 2023]) and the fixed horizon rate is $O(1/\sqrt{t})$ (Theorem 5.5 in [Garrigos & Gower, 2023]) which is the rate you pointed out.

Q2 Rates for SPSmax and NGN … not invariant under function scaling

SPSmax and NGN do depend on the function scaling. If we turn to eq (4) for SPSmax, it is not invariant because the constant $\gamma_b$ does not scale with the function. Where as the learning rate of NGN is given by $\gamma_k = \frac{\sigma f(x_k)}{f(x_k) + \sigma \| \nabla f(x_k)\|^2}$ The $\| \nabla f(x_k)\|^2$ scales quadratically, whereas $f(x_k)$ scales linearly with $f$ . Thus NGN cannot be invariant to scaling the function.

Q3 $\sigma_{pos}$ is likely equal to $\sigma_{*}^2$

These two constants are not equivalent, instead $\sigma_{*}^2 = f(x_{\*}) - \sigma_{pos}.$ In the setting where $\sigma_{pos} =0$ (regression models with no weight decay), we still could have $\sigma_*^2 \neq 0$

Q4 novelty compared to previous SPS* results

The novelty of our analysis of SPS* is in the smooth setting. Outside of the interpolation setting, we give the first convergence for SPS* in the smooth setting. Previous results in Gower et al., 2021 and Garrigos et al., 2023 only establish that SPS* converges to a fixed neighborhood of the solution.

Q5 impact of mini-batching

To see the dependency on mini-batching, we change the expectation in eq (9) to be over the mini-batching, instead of directly over $\xi$ . Then $A$ and $B$ will now depend on the batch size $b$ . For example, if we use sampling without replacement in the smooth $n$ --finite sum setting, then according to Proposition 3.8, item (iv) [Gower et al 2019] the constants $A$ and $B$ in eq (9) have the following form

$A = \frac{n}{b}\frac{b-1}{n-1} L_{full} + \frac{n-b}{b(n-1)} L_{\max}$ $B = \frac{1}{b}\frac{n-b}{n-1} \sigma^2_*$

Where $L_{full}$ and $L_{\max}$ are the full batch and max over stochastic functions smoothness constant, respectively.

For the non-smooth setting, the right hand side in (11) would be $G^2/b$ if we used $b$ --sampling with replacement.

Q6 inconsistent definitions of $\sigma_*$ throughout

Thank you and well spotted. We will no be consistent and use function noise $\sigma_f^2 =f(x_*) - \mathbb{E}{\inf f_{\xi}}$ throughout.

Q7 Table 1 uses function suboptimality and distance to the solution.

We will convert all results to function suboptimality, at the cost of an additional fixed $1/\mu$ multiplicative factor for the strongly convex problems.

Q8 Missing references

Thank you for the two additional references, we will include them both, and comments about their relative smoothness results.

Q9 Table 1 SGD has "X" in the first column? Constrain to the ball

We agree, but this would require access to the ball $B_D$ that contains $x_*$ , so that the iterates $x_k$ can be projected back onto $B_D$ . But this is a different oracle access. Depending on the problem, we would not have access to this ball. But we can include a footnote on this.

Q10 Why not define $\sigma_*$ in (13)

We use Corollary 2.3 to give a precise setting, and Theorem 2.1 and eq (9) for the general setting. But we do agree that Corollary 2.3 also holds when using gradient noise given by $\sigma_* = L \mathbb{E}\|g_{\xi}(x_*)\|^2$ .

Q11 Why introduce Bregman div in (20)

We introduce the Bregman div because it is a positive term that provides a minor speed-up of IAM over SPS*.

Q12 What if $B=0$ in Theorem G.1

Good question, the result in Theorem G.1 does not hold for $B=0.$ Instead when $B=0$ , from equation (49) we can prove that SPS* converges linearly according to $\mathbb{E}\\| x_{t+1}-x_*\\|^2 \leq (1-\mu/(2A))\mathbb{E}\\|x_{t}-x_*\\|^2$ We will edit the theorem to include this remark.

审稿意见

评分: 42025-03-13

The paper provides an analysis in different convex settings of the stochastic Polyak step size SPS* and a version with momentum/iterate averaging they call Iterate Averaging Adaptive Method (IAM). The paper analyzes convergence rates in both non-smooth and smooth settings, and shows that both SPS* and IAM are able to adapt to the interpolation noise level $\sigma_\ast^2$ to interpolate between $1/\sqrt{T}$ and $1/T$ rates. They also discuss the limitations imposed by needing access to $f_\xi(x_\ast)$ , the value of the optimum on each batch, and presents a model distillation setup in which this is possible to estimate using the teacher network.

给作者的问题

n/a

论据与证据

Yes. The focus of this paper is on proving convergence rates in for SPS* and IAM for convex objectives and all proofs are presented in the appendix.

方法与评估标准

The paper evaluated SPS* and IAM (and the Adam variants) on a model distillation task, which makes sense as a way to circumvent the main issue with these methods, which is that they require knowledge of $f_\xi(x_\ast)$ . While the experiments are fairly limited/small scale, they serve as a nice proof of concept.

理论论述

Yes. I checked the proofs in Appendices C, D.1-D.4 and skimmed the remaining proofs in the appendix.

实验设计与分析

n/a

补充材料

See "Theoretical Claims"

与现有文献的关系

This paper continues a line of work extending the Polyak step size to the stochastic setting. They provide clean and simple proofs for convergence for SPS* and extend the same arguments to IAM. These methods also fall into the category of "parameter free" methods, although they require access to $f_\xi(x_\ast)$ which may be difficult to estimate.

遗漏的重要参考文献

n/a

其他优缺点

Strengths:

Both the main paper and the appendix are well written and have clear comparisons to prior work
The proof sketch for SPS* (Appendix D.1) is very clear and makes it easy to read the full proof in D.2.
The paper is forthcoming about its weaknesses (e.g. the need to estimate $f_\xi(x_\ast)$ ) and discusses situations in which this is possible.

Weaknesses:

It is not clear that the theory is relevant for the experiments in Figure 1 as the setting is highly non-convex. However, this is a common issue faced by papers in convex optimization theory and is not the focus of the paper.

其他意见或建议

Lines 307-308: There should be a square root for the SGD convergence rate to match eq. (42)

作者回复

2025-03-28

Dear Reviewer,

thank you for your thoughtful review, and doing the expert work of checking our proofs and appendix, and catching the type-O on Lines 307-308. Regarding the weakness you highlighted

It is not clear that the theory is relevant for the experiments in Figure 1 as the setting is highly non-convex. However, this is a common issue faced by papers in convex optimization theory and is not the focus of the paper.

We do of course agree. But we have written a small paragraph “Defense of convex optimization in machine learning” in response to reviewer UUoW rejection of our work because of the convexity assumption. We copy the relevant parts of our response here for your consideration, though it does not truly address your very valid point.

Defense of convex optimization in machine learning.

Stochastic non-smooth convex optimization has a surprising impact on deep learning practice.

For instance, the development of the Adagrad method [Duchi2011] was based on its convergence analysis in the non-smooth convex setting. Adagrad went on to be widely used in Deep learning, until the development of exponentiated weighted variants of Adagrad such as RMSprop and ADAM. Thus non-smooth convex analysis has a significant role to play in inspiring the most popular optimization ADAM for deep learning. Much more recently, the Schedule-free method [Defazio2024] which was developed to have fast last-iterate convergence in the non-smooth convex setting, set a new record in the AlgoPerf competition in the self-tuning category:

https://mlcommons.org/2024/08/mlc-algoperf-benchmark-competition/

The fact that non-smooth convex analysis is a reasonable model for predicting performance on Deep learning, including training large language models, is now a well-established conundrum [Schaipp2025]. Where-as, the convergence analysis of SGD in the non-convex setting has not had such an impact on Deep learning practice. As such, convex optimization has a surprisingly important and active role to play in deep learning practice.

[Duchi2011] Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research 12 (7): 2121–2159, 2011

[Defazio2024] Aaron Defazio, Xingyu Alice Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, Ashok Cutkosky, “The Road Less Scheduled”. Neurips Oral 2024

[Schaipp2024] Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach, “The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training”, arXiv:2501.18965, 2025

审稿意见

评分: 12025-03-17

This paper presents some optimization methods with idealized stochastic Polyak step (SPS) sizes and their convergence analyses under the convexity conditions of objective functions. The analyses are based on the following two methods:

[Section 2] Stochastic gradient descent (SGD) with SPS (Theorem 2.1) under

Expected locally Lipschitz assumption (Corollary 2.2)
local expected smoothness assumption (Corollary 2.3)

[Section 3] Iterate averaging adaptive method (IAM) with SPS under

Expected locally Lipschitz assumption (Theorem 3.2)
local expected smoothness assumption (Theorem 3.3)

It also provides some numerical results to support their analyses (Section 4).

update after rebuttal

Thank you for your comments. I checked all of your replies.

I have discussed my concern with AC and other reviewers during AC and Reviewers discussion period. As a result, I am not satisfied with your recent response.

I believe that the assumption such that $f_i$ is convex on a neighborhood at the global minimizer $x_*$ of $f = (1/n) \sum_{i=1}^n f_i$ is strong, since I do not know practical examples of $f$ and $f_i$ satisfying the assumption and we have many counterexamples. I asked the authors for providing examples of the assumptions in Sections 2 and 3 and the authors said that "As for the Black-box Model Distillation in Section 4.1, the student models are decoder only transformers, as such, they are non-convex deep neural networks." Unfortunately, I am not satisfied with the authors' response, and hence, at this time, I conclude that the assumption such that $f_i$ is convex on a neighborhood at the global minimizer $x_*$ and the assumptions in Sections 2 and 3 are unrealistic.

It would be grateful if the authors could provide practical examples such that $f_i$ is convex on a neighborhood at the global minimizer $x_*$ of $f = (1/n) \sum_{i=1}^n f_i$ . At least, I have no example of such $f_i$ s. Hence, I believe the revised manuscript should be submitted to other venues. My opinion is that the convexity condition over the whole space is unrealistic, while the convexity condition over a neighborhood at a local minimizer is acceptable.

给作者的问题

Theoretical Concerns: For simplicity, let us consider a finite sum minimization (Remark 2.4). I understand a consideration under the convexity condition is significant for ML community. However, I believe that the assumption in Theorem 2.1 such that $f\_\xi$ is convex over the whole space $\mathbb{R}^d$ is strong for ML problems. For example, since the cross-entropy loss used in Section 4 (Page 8) is non-convex over the whole space, the assumption is strong. Let $x^*$ be a local minimizer of $f_i$ (or $f$ ) . We must replace the assumption with (A1) `` $f\_i$ is convex over a $\delta_i$ -neighborhood at $x^*$ (or $f$ is convex over a $\delta$ -neighborhood at $x^*$ ), while $f$ and $f\_i$ are non-convex over the whole space $\mathbb{R}^d$ ." The $\delta$ -neighborhood at $x^*$ is denoted by $N(x^*; \delta)$ . Here, I have the following concerns under (A1):

Does $x_t$ $(t = 1,2,\cdots)$ generated by SGD (2) belong to $N(x^*; \delta)$ ? That is, is it guaranteed that the points generated by SGD are in a neighborhood at a local minimizer of $f$ with probability 1? If $x_t \notin N(x^*; \delta)$ , then, since $f_i$ (or $f$ ) is not convex around $x_t$ , the results in Theorem 2.1 (in particular, (8) and (10)) do not hold. Therefore, we must prove that, for all $t$ , $x_t \in N(x^*; \delta)$ with probability 1.
We assume that $f$ is continuously differentiable. Even if $x_t \in N(x^*; \delta)$ , we do not know whether $x_{t+1} = x_t - \gamma_t g_t$ is in $N(x^*; \delta)$ , since $g_t$ is a stochastic gradient of $f$ (i.e., there exists an upper bound of the variance $\sigma^2$ such that $\mathbb{E}\_\xi [\Vert g_t - \nabla f (x_t)\Vert^2] \leq \sigma^2$ ).

The above concerns are applied to the results of IAM (Section 3).

Practical Concerns: Do the objective functions considered in Section 4 satisfy the assumptions in Sections 2 and 3 (for example, in Theorem 2.1, $f\_\xi$ is convex over $\mathbb{R}^d$ and (9))?

From the above discussion, my recommendation is Reject. However, I can change my score if the authors could address the above theoretical and practical concerns.

论据与证据

I am an emergency reviewer of the paper ICML Submission 261. Hence, I could not check all the proofs of the theorems. However, I have theoretical and practical concerns for Claims And Evidence. Please see Questions For Authors.

方法与评估标准

Numerical comparisons in Section 4 would be insufficient to support the performances of the proposed methods. More numerical comparisons with large-scale datasets and networks would be needed.

理论论述

I have some theoretical concerns. Please see Questions For Authors.

实验设计与分析

I have some practical concerns. Please see Questions For Authors.

补充材料

N/A

与现有文献的关系

The paper compares the proposed analyses with the existing ones (Table 1). However, more comparisons and discussions for the existing analyses are needed. For example, it is needed to discuss not only convex setting but also non-convexity setting, and it is needed to compare SGD, momentum, and adaptive methods using mini-batch stochastic (sub)gradients with the proposed methods.

遗漏的重要参考文献

I think that the references are sufficient.

其他优缺点

I have theoretical and practical concerns. Please see Questions For Authors.

其他意见或建议

I am an emergency reviewer. I am sorry to find only the following typos:

Abstract: 022: a O --> an O
(3): ``support" is not defined.
Theorem 2.1: I guess that $\mathbb{E}\_\xi$ is an expectation with respect to $\xi$ . Meanwhile, $\mathbb{E}$ in (10) is not defined.

作者回复

2025-03-27

Dear reviewer, the main issue you have raised is regarding our assumption in our theorems that the loss is convex. We will divide our answer into two parts. In our first part, we will show how to relax global convexity to a form of local convexity. Indeed, this extension follows straightforwardly by using that SPS* and IAM have the very unusual property of being monotonic (Eq (8) and Lemma 3.1). In our second part, we write a defense for convex optimization in machine learning. After which we answer a minor point below.

We hope the reviewer will reconsider their position in light of our response.

Extending beyond convexity.
Our theorems do not require global convexity. Instead, we need only assume that for the given starting iterate $x_0 \in \mathbb{R}^d$ , that the $f_\xi$ functions are almost surely convex within the closed ball

B_D(x^*) = \lbrace x : \Vert x - x^* \Vert \leq \Vert x_0 - x^* \Vert \rbrace

Using your notation, we need convexity in an $\delta$ -neighborhood around $x^*$ where $\delta = \Vert x_0 - x^* \Vert .$

The reason we are able to relax the global convexity assumption is because SPS* and IAM have the very unusual property (amongst stochastic methods) of being monotonic. That is, for the iterates $x_t$ (or $z_t$ for IAM), we have that

\Vert x_{t+1} - x_* \Vert \leq \Vert x_t - x_* \Vert \leq \cdots \leq \Vert x_0 - x_* \Vert.

The same cannot be said about SGD or any variant of SGD that we are aware of, since the iterates may escape the closed ball with some probability. We will include a remark on this in our revision.

Defense of convex optimization in machine learning.
As a mere formality, we first point out that convex optimization is one of the official topics of interest listed in ICML’s 2025 call for papers:
https://icml.cc/Conferences/2025/CallForPapers

As such, we do not believe it is a valid reason to recommend rejecting our paper because our analysis focuses on stochastic convex optimization. Moreover, stochastic non-smooth convex optimization has had, and continues to have a surprising impact on deep learning practice.

For instance:

Adagrad [1] was developed based on its favourable non-smooth convex anytime convergence. Adagrad went on to be widely used in Deep learning, until the development of exponentiated weighted variants of Adagrad such as RMSprop and ADAM. Thus non-smooth convex analysis has a significant role to play in inspiring the most popular optimization method (ADAM) for deep learning
The Schedule-free method [2], designed based on its fast last iterate convergence for non-smooth convex settings, just recently set a record and one first place in the AlgoPerf competition in the self-tuning category for deep learning (https://mlcommons.org/2024/08/mlc-algoperf-benchmark-competition/).

The fact that non-smooth convex analysis has some predictive power on the performance in deep learning (including LLMs) is now a well-established conundrum [3]. In contrast, non-convex SGD analysis has had far less practical impact on deep learning. Thus, convex optimization remains relevant to deep learning practice, justifying its inclusion in ICML’s topics of interest.

Q1. Do the objective functions considered in Section 4 satisfy the assumptions?

A1. The Poisson regression experiments mentioned in Section 4 (and relegated to Appendix L.1 due to space constraints), is a globally convex problem. In fact, this is one problem for which SPS* and IAM provenly converge (see Remark 2.4), and for which we are unaware of any other method that converges. This experiment was to highlight the predictive power of our theory. As for the Black-box Model Distillation in Section 4.1, the student models are decoder only transformers, as such, they are non-convex deep neural networks. Thus our theorems do not explicitly hold for these problems.

References

[1]: Duchi, J., Hazan, E., & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159, 2011

[2]: Aaron Defazio, Xingyu Alice Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, Ashok Cutkosky. The Road Less Scheduled, Oral at NeurIPS 2024.

[3]: Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach. The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training, arXiv:2501.18965, 2025

审稿人评论

2025-04-07

Thank you for your comments. However, I am not satisfied with your comments.

Extending beyond convexity: I do not know why SPS* and IAM have the very unusual property (amongst stochastic methods) of being monotonic. Could you prove it without using the convexity condition over the whole space? (this is my question. Please check the above comments.)

Defense of convex optimization in machine learning: I understand a consideration under the convexity condition is significant for ML community. (Please check the above comments.)

A1: Thank you for your reply. However, your comment implies that the convexity condition is strong in practice.

Therefore, my score remains unchanged.

作者评论

2025-04-07

Dear Reviewer,

thank you for engaging further, giving us a chance to address your concerns. We address all of your questions/concerns below.

Q2.1 Extending beyond convexity and the descent lemma.

Yes, we can prove the descent lemma for SPS* and IAM using only local convexity. The argument is a simple induction. In words, our induction proof will establish that, if we assume $f_{\xi}$ is almost surely convex in $B_D(x_0) := \lbrace x : \Vert x - x_* \Vert \leq \Vert x_0 - x_* \Vert \rbrace,$ then all the iterates $x_t$ are in $B_D(x_0)$ , and consequently we have a descent lemma at every step.

Now for the proof: Suppose that $x_0$ is such that $f_0 := f_{\xi_0}$ is almost surely convex at $x_0,$ in other words for $t=0$ we assume

$f_t(x) \geq f_t(x_t) + \langle g_t, x- x_t \rangle,$ where $g_t$ is a stochastic subgradient.

Using the above and (6) in our paper, and for $t=0$ , we then have that:

$\\| x_{t+1} - x_* \\|^2 - \\| x_{t} - x_* \\|^2 \leq 2 \gamma_t (f_{t}(x_t) - f_t(x_*)) + \gamma_t \\| g_t\\|^2$

Minimizing the right hand side in $\gamma_t\geq 0$ gives $\gamma_t = \frac{(f_{t}(x_t) - f_t(x_*))_+}{\\| g_t\\|^2}$

Plugging this back into the above gives $\\| x_{t+1} - x_* \\|^2 - \\| x_{t} - x_* \\|^2 \leq -\frac{(f_{t}(x_t) - f_t(x_*))_+^2}{\\| g_t\\|^2}$

This shows that we have descent at step $t=0$ and consequently $x_1$ must be closer to the solution $x_*$ as compared to $x_0$ , that is $x_1 \in B_D(x_0).$

Now our induction argument is that if $x_t$ is in $B_D(x_0)$ , then $x_{t+1}$ is in $B_D(x_0)$ . The proof of the induction step follows the exact same steps above but for any $t$ (instead of just $t=0$ ). Because of this, we need only assume $f_{\xi}$ is convex in $B_D(x_0)$ and not globally.

Q2.2 your comment implies that the convexity condition is strong in practice

The fact that Poisson regression is globally convex, does not show global convexity is a strong assumption. Poisson regression is one of many globally convex machine learning problems that enjoys countless applications. In terms of theory, assuming only global convexity is a very weak assumption. More assumptions are always included (Lipschitz or Smooth) in order to achieve a convergence rate.

To understand further how global convexity is not a strong assumption, consider the fact that Poisson regression is being applied to numerous tasks in Operations and Queueing Systems (counting arrivals, customer call, requests per second), in Insurance and Actuarial Science (counting number of claims per policyholder, accidents ..etc), in Neuroscience ( Modeling spike counts, counting event-related potentials), and the list goes. There are entire book dedicated to the applications of Poisson regression, such as

Cameron AC, Trivedi PK. Regression Analysis of Count Data. 2nd ed. Cambridge University Press; 2013.

Thus, despite Poisson regression being globally convex, it gives us access to all these applications.

Furthermore, there are countless training problems in machine learning that are globally convex such as

Linear Regression
Support Vector Machines
Logistic Regression
Boosting (for instance Adaboost with an exponential loss)
Conditional Random Fields
Linear Quadratic Regulator (LQR) (in Optimal control and Reinforcement learning)
Kernel Ridge Regression

The list goes on, and is also the subject of several text books such as

Changho Suh, Convex Optimization for Machine Learning, Now Publishers (2023)

Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press (2014)

In summary, (globally) convex problems in machine learning is an entire field. Though it excludes deep learning, it still enjoys many daily and commonly used applications. The fact that our method is both stochastic and only relies on convexity within $B_D(x_0)$ is exceptional within the field.

We hope the review will reconsider their position in light of our response.

审稿意见

评分: 32025-03-17

This paper studies an idealized method of stochastic Polyak step size called SPS*, assuming access to the optimal loss value that can be achieved by a solution. It is shown that SPS* achieves the optimal convergence bound for globally Lipschitz function, and also enjoys anytime convergence with rate $O(1/\sqrt{t})$ for smooth functions. It is also shown that momentum can be incorporated to guarantee the convergence rate of the last iterate. Acknowledging the impracticality of having access to the optimal loss value in general, the paper also identifies settings where it is possible to approximate the optimal loss value, including 1) the interpolation setting where the optimal loss value should be 0, and 2) the setting where the optimal loss value can be approximated, e.g., model distisllation where the optimal loss value can be approximated by the teacher model.

给作者的问题

Why does the approximation of the optimal loss value have to be an underestimate?

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed methods and evaluation criteria make sense for the problem at hand.

理论论述

I did not check the correctness of the proofs in the appendix.

实验设计与分析

The experimental designs and analyses discussed in the main text seem sound.

补充材料

I took a quick look at the supplementary details for the experiments in Appendix L. The appendix provides additional details on the experimental setup and results, which are helpful for understanding the empirical evaluation.

与现有文献的关系

The key contributions of the paper are about the convergence properties of the method of Polyak step size, and the incorporation of momentum to guarantee the convergence rate of the last iterate. This is related to the broader literature about learning rate schedules and momentum methods in optimization.

遗漏的重要参考文献

The essential references have been properly discussed in the paper.

其他优缺点

The paper is well written and presents a clear comparison between the convergence rates of SPS* and other existing methods. Though SPS* is an idealized method, the authors provide a clear motivation for studying this method and show that it can be useful in practice in certain settings, where the example of model distillation is particularly interesting.

It would be helpful if further discussions are provided on how to approximate the optimal loss value in general settings, and how the convergence analysis would be affected by the approximation error.

其他意见或建议

N/A

作者回复

2025-03-27

We thank the reviewer for their thoughtful review, and appreciating our work.

Q1. It would be helpful if further discussions are provided on how to approximate the optimal loss value in general settings, and how the convergence analysis would be affected by the approximation error.

A1. This is a good question. We are currently working on a follow-up on how the optimal loss value could be approximated on language models. The optimal cross-entropy loss over the training data is lower bounded by the average number of bits used to compress this data when using an optimal compression method. The final loss over a batch is then given by the average number of bits used to compress this batch. Though we do not have access to an optimal compressor, compressing is in general an easier task than predicting the next token, and thus, we do have access to very good compressors. This would suggest the following approximation: Compress the training data using a good compressor $\implies$ Use the average number of bits used to encode a batch of data as an approximation to the optimal batch loss $\implies$ Run IAM with this approximation. We can consider adding this to potential future work, if the reviewer thinks it is insightful or partially answers their question. We are also considering how to extend our convergence analysis to allow for an approximation error. This has proved challenging, and we cannot guarantee that our revision will include results on this.

审稿意见

评分: 32025-03-17

This paper presents a new anytime convergence result for stochastic polyak step-size which requires access to the loss for every training batch evaluated at a solution. They also provide a momentum variant of this method and prove convergence rates with respect to the final iterate. Lastly, they provide a small set of experiments which compare an approximation of their momentum optimization algorithm to other standard optimizers for toy experiments and a distillation problem.

给作者的问题

Questions I had while reading the manuscript

Can you explain this statement more formally: "By anytime, we mean a proof that the method converges to any predefined tolerance without prior knowledge of that tolerance."
"The proof of convergence for SPS* in the Lipschitz convex and strongly convex setting was first given in (...)" - the main contribution is then your convergence result (for the same algorithm) is any-time vs not anytime.
" ... but they either require bounded domain (Levy et al., 2018; Kavis et al., 2019) or are only studied in the deterministic setting (Li & Lan, 2023)." Can you clarify what the difference between this assumption and what you assume in line 270 is?
The teacher student setting does not really explain why that would be any different from a general stochastic setting. Why would we expect a better approximation to zeta there over other settings. Is it just that the surrogate (from the teacher) should be close to the minimum provided by the original problem? If so this seems very related to the online imitation learning setting.
You state that "assuming access to fξ(x∗) is not the same as assuming that interpolation holds", but again I am a bit unclear on the relationship between this assumption and the general stochastic / deterministic assumption.

论据与证据

SPS* Convergence Rates

Claim:

SPS* achieves optimal/best-known convergence rates under a new set of assumptions not considered by existing optimization literature

Evidence:

Formal proofs in Theorem 2.1 and Corollaries 2.2, 2.3
Shows O(1/√T) rate for Lipschitz functions which matching known lower bounds
First O(1/√T) anytime convergence rate for smooth convex case

IAM Last-Iterate Convergence

Claim:

IAM achieves same favorable rates as SPS* but for last iterate

Evidence:

Theoretical proofs in Theorems 3.2 and 3.3
Avoids need for iterate averaging which is less practical
Matches optimal rates in non-smooth case

Adaptivity to Interpolation

Claim:

Methods automatically adapt to interpolation settings

Evidence:

Theoretical analysis showing rates improve when interpolation parameter σ* approaches 0
Proofs show O(1/T) convergence under interpolation vs O(1/√T) otherwise

Convergence Without Smoothness/Lipschitz Assumptions

Claim:

Methods converge on non-smooth, non-Lipschitz problems

Evidence:

Experiments on Poisson regression showing convergence comparable to L-BFGS
Performance matches SGD with best manually-tuned learning rate

Robustness to Misspecification

Claim:

Methods are somewhat robust to errors in optimal loss estimates

Evidence:

Experiments showing convergence (though slower) when using approximate/averaged values
Detailed ablation studies with different levels of noise

Effectiveness for Model Distillation

Claim:

Methods enable efficient distillation without hyperparameter tuning

Evidence:

Experiments distilling large GPT-2 models on three datasets
Competitive performance vs carefully tuned SGD/Adam baselines
Learning rate plots showing sensible adaptive behavior

方法与评估标准

Empirical

The methods and evaluation criteria are appropriate for the stated goals, with evaluation chosen to:

Demonstrate practical utility in distillation
Test robustness to key assumptions

Weaknesses

Could include more diverse set of optimization problems
Relatively small number of datasets
Limited comparison to other adaptive methods (especially Adagrad / SLS variants)
One SLS variant, SSO, seems designed for applications related to the distillation problem (online imitation learning) and may be a good variant to consider evaluation agains.
No evaluation or discussion of computational overhead
Some confusion (for me) around why distillation satisfies their assumption (outside of non-convexity issues)

理论论述

I did not have time to work through the entire appendix, however the results which they present seem reasonable given the assumptions that they make regarding access to the true function evaluation for every batch. I would like some additional discussion around exactly how this assumption relates to interpolation settings and deterministic settings. It just kinda seems like the assumption of knowing the true f_zeta seems to reduce the problem to a deterministic one, which from my understanding becomes effectively equivalent to an interpolation setting.

实验设计与分析

Poisson Regression Experiments:

Summary

Comparison methodology against L-BFGS and SGD
Learning rate sweep for baselines (0.001 × {0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 20, 50})
Dataset characteristics

Sound aspects:

Appropriate baseline methods
Reasonable learning rate range
Clear reporting of epochs/convergence

Issues:

No mention of number of repeated runs/statistical significance

Misspecification Study:

Summary:

Experimental setup with synthetic quadratic problems
Three versions of IAM tested (theoretical, averaged, lower-bound)

Sound aspects:

Well-controlled synthetic setup
Clear parameter settings
Systematic variation of noise (ν) and batch size (b)

Issues:

No error bars/variance reporting
Single problem structure (quadratic) may limit generalizability

Distillation Experiments:

Summary:

Dataset splits and sizes
Model architectures
Hyperparameter tuning procedure
Baseline comparison methodology

Sound aspects:

Multiple datasets
Clear documentation of model sizes
Systematic hyperparameter search for baselines

Issues:

No reporting of variance across runs
No discussion of computational costs
Training durations not clearly specified

Limitations in Experimental Design:

Lack of statistical significance testing
Missing error bars/variance reporting
Incomplete documentation of computational requirements
Limited ablation studies on algorithmic components

While the experimental designs are generally reasonable, the lack of statistical analysis and variance reporting makes it difficult to fully assess the robustness of the results (I did not catch a mention of seeds or multiple runs).

补充材料

I was only able to briefly browse the material in the appendix, and largely focused on the experimental results.

与现有文献的关系

I cannot speak to the theoretical impacts to the larger scientific community but the experimental results seem at best incremental, and are currently missing quite a few comparisons.

遗漏的重要参考文献

From my understanding of this area the authors seem to reference all of the folks which I am aware which work on these methods. There could however be some papers which are newer that I am not aware of.

其他优缺点

Strengths:

Theoretical Depth & Rigor:

Comprehensive convergence analysis
Clear proofs and technical innovations
Handles multiple function classes
Adapts automatically to interpolation

Practical Relevance:

Demonstrates real application in model distillation
Shows how to incorporate momentum and Adam-style preconditioning
Seems to avoid hyperparameter tuning in practice

Organization & Clarity:

Clear connection between theory and practice
Well-structured presentation
Thorough appendices with detailed proofs

Technical Innovation:

Novel analysis techniques (e.g., explicit inversion of convex monotone function)
First O(1/√T) anytime rate for smooth case

Validation Strategy:

Tests both theoretical claims and practical utility
Uses multiple datasets and model sizes
Includes ablation on misspecified optimal losses

Weaknesses:

Experimental Rigor:

No reporting of random seeds
Missing error bars/statistical significance
Limited ablation studies
No analysis of computational overhead

Limited Comparisons:

Insufficient theoretical comparison to AdaGrad
Limited empirical comparison to other adaptive methods
Relatively small number of datasets

Practical Limitations:

Core method (SPS*) requires access to optimal batch losses
Theoretical results assume convexity
Primary applications limited to special cases (interpolation/distillation)

Missing Analysis:

No discussion of computational complexity
Limited analysis of robustness to hyper-parameters
No investigation of failure modes

Empirical Gaps:

Limited diversity in optimization problems and algorithms tested
No large-scale benchmarking

其他意见或建议

Most of my issues are on the experimental side, so if you could add additional baselines to your comparisons I would consider increasing my score, subject the other reviewers substantiating the relevance of the theoretical claims.

作者回复

2025-03-27

We thank the reviewer for their comprehensive review. We have divided our answers in Main Questions, and Questions

MQ1 anytime convergence formally?

We say an algorithm has a finite horizon convergence if, for every tolerance $\epsilon>0$ , there exists a choice of parameters and there exists $T \geq 1$ such that $\mathbb{E}[f(x_T) - f(x)] \leq \epsilon$ .

If an algorithm has an anytime convergence rate then it has a finite time convergence rate, but the reciprocal is not true in general. The counterexample is SGD with constant stepsize, which does not have an anytime convergence rate, but has a finite time convergence.

MQ2 main contribution is (SPS*) anytime rate?

Our main contributions are the anytime rates for SPS* in the smooth (convex+ str convex) setting, proposing the IAM method, all the last iterate analysis of IAM, and identifying a reasonable application (black box model distillation).

MQ3 Difference between bounded domain and line 270

The bounded domain assumes we know a bounded set $\mathcal{B}\subset \mathbb{R}^d$ such that $x_* \in \mathcal{B}$ . If we know such a set, we can modify SGD to project $x_t$ onto $\mathcal{B}$ .

Our assumption on line 270 and Eq (9), is very different and is almost for free. We require that the gradients verify some inequality on a certain ball, that we do not need to know. As we emphasize in Remark 2.4, for finite sum problems our assumption is always true, because convex functions always have bounded gradients on every bounded set.

MQ4 The teacher should be close to the minimum? Related to online imitation learning?

Yes, the teacher's loss is assumed to be close to the minimum student loss, and thus we can use the teacher loss to set the learning rate of the student using the IAM. There is some relation to the online imitation learning setting, both involve a student learning from a teacher, but there are differences. In online imitation learning the student learns a policy, and has to interact with an environment. Where-as in our setting there is no environment or policy.

MQ5 Unclear "access to $f_{\xi}(x^*)$ not the same as interpolation"

This is a good question. In short, assuming access to $f_{\xi}(x^*)$ imposes no restriction on the problem itself. This could be any stochastic optimization problem. Rather, this is an assumption as to what we know regarding the problem, and what we have access to.

The interpolation assumption does impose a constraint on what type of problems we are considering. Interpolation restricts our problem to models that can perfectly fit every single data point. This can (approximately) occur in image classification tasks, where the neural network is sufficiently overparameterized with respect to the data set. This cannot occur when training a generative language model, because it is impossible to perfectly guess the next token in a sequence due to ambiguities in languages. Because of this, the cross-entropy loss is never zero, and is always above the entropy of language (around 2.0). Thus the interpolation assumption cannot hold in generative language modelling. In contrast, assuming access to $f_{\xi}(x^*)$ imposes no restriction to the model itself, and it can occur in setting such a generative language modelling. For instance, such as out black-box model distillation problem

Q1 Comparisons to Adagrad

In the non-smooth setting, Adagrad has a guaranteed convergence of $\sqrt{2}GD/\sqrt{T}$ , which is $\sqrt 2$ worse than our rates in for SPS* and IAM in Eq (12) and (20). The difference between this rate of Adagrad and our rates for IAM, is this rate holds for the last iterate of IAM and Adagrad assumes access to a bounded domain $D$ which contains the solution. We can include numerical comparisons to Adagrad, but AdamW is SOTA for LLMs, not Adagrad, which perform poorly for LLMs.

Q2 Comparisons SLS variant, SSO

Neither SLS, or the SSO variant, are applicable in our black-box model distillation setting. SLS assumes that we can use the teacher to generate a candidate sequence, where-as SSO assumes we have access to the logits of the teacher. But our black-box setting assumes we only have access to the final loss of the teacher, and not the logits.

Q2 Discussion on computational overhead

Certainly, SPS* and IAM impose a small additional computational overhead as compared to SGD and SGD-M. Both SPS* and IAM need only store one additional scalar, and do two inner products with respect to the gradient. We will add this remark to the appendix.

Q3 Random seeds

We use $42$ to set all random seeds.

Q4 No failure modes

We investigate failure modes regarding misspecification of the optimal loss values in Section L2.

最终决定Reject

2025-05-01

This paper analyzes the stochastic gradient method with Polyak stepsizes, termed SPS*. With access to the optimal loss value, the method adaptively selects stepsizes and is proven to achieve the optimal convergence rate for globally Lipschitz functions—without requiring knowledge of problem-specific constants such as the distance to the solution or the Lipschitz constant. The paper further extends SPS* by incorporating momentum, achieving nearly the same convergence rate but applicable to the last iterate instead of the average. Computational experiments demonstrate that the optimal loss value can be effectively approximated, enabling efficient implementation of SPS* with empirical performance.

However, one reviewer raised significant concerns regarding the assumption of local convexity. Specifically, the assumption that each component function $f_i$ is convex in a neighborhood of the global minimizer $x_*$ of $f = \frac{1}{n} \sum_{i=1}^n f_i$ is quite strong. The authors do not provide any practical examples to support this assumption, which undermines the applicability of the theoretical results in nonconvex settings. As such, the theoretical contributions may be seen as overly optimistic or unrealistic under practical scenarios. Given these concerns, the current version of the paper does not appear ready for publication. We encourage the authors to address the reviewers’ feedback and revise the manuscript accordingly.