PaperHub
5.0
/10
Rejected5 位审稿人
最低1最高4标准差1.0
3
4
1
3
3
ICML 2025

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

OpenReviewPDF
提交: 2025-01-09更新: 2025-06-18
TL;DR

How to use the knowledge of optimal values to design efficient optimization methods

摘要

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS*. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have a $\mathcal{O}(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.
关键词
Polyak step sizemomentumoptimization theorymodel distillation

评审与讨论

审稿意见
3

This paper analyzes the stochastic gradient method with Polyak stepsizes, termed SPS*, for minimizing the expectation of a family of convex functions fξf_{\xi}. The method adaptively selects stepsizes based on knowledge of fξ(x)f_{\xi}(x^*), the function value at the optimal point. The authors demonstrate that SPS* is simultaneously adaptive in various settings, including nonsmooth, smooth, and strongly convex objectives. It achieves the same convergence rates as SGD with well-tuned stepsizes without requiring knowledge of problem-specific constants such as the distance to the solution or the Lipschitz constant. The paper also extends SPS* by incorporating momentum, achieving nearly the same convergence rates but valid for the last iterate instead of the average one. Computational experiments illustrate that for some problems, fξ(x)f_{\xi}(x^*) can be well-approximated, enabling efficient implementation of SPS* with good performance.

给作者的问题

  1. Please address the weaknesses noted above, particularly the incorrect convergence rates throughout the paper.

  2. Why does Table 1 mark SGD with an "X" in the first column? SGD can also be constrained to the DD-ball if necessary.

  3. What is the motivation for defining σ2=inffE[inffξ]\sigma_*^2 = \inf f - \mathbb{E}[\inf f_{\xi}] in Corollary 2.3? Why not define it using inequality (13)? The current definition excludes important cases, such as when Lσ2=E[gξ(x\*)2]L \sigma_*^2 = \mathbb{E}[\\| g_{\xi}(x^\*) \\|^2].

  4. What is the purpose of introducing Bregman divergence terms in (20)?

  5. What happens in Theorem G.1 when B=0B = 0? This case is not currently covered.

论据与证据

The claims made in the paper are well-supported, with each theorem and lemma accompanied by a corresponding proof.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem at hand.

理论论述

I have reviewed the main theoretical claims and assessed the general idea behind each proof. However, I did not verify all details rigorously.

实验设计与分析

Since the paper is primarily theoretical, I did not conduct an in-depth review of the experimental designs.

补充材料

I reviewed the supplementary material, including the appendix, and inspected each proof.

与现有文献的关系

This work builds on the SPS* method, previously studied in (Gower et al., 2021; Garrigos et al., 2023). Some of the results may have been known in special cases (e.g., for Lipschitz functions or smooth functions under the interpolation assumption), but to my knowledge, they have not been established under the more general assumptions considered in this paper.

遗漏的重要参考文献

I have not identified any essential references missing from the discussion.

其他优缺点

Strengths:

  1. The paper presents compelling arguments for the adaptivity of the stochastic Polyak method across key optimization settings.

  2. The proofs are generally simple and easy to follow, allowing interested readers to verify the details.

  3. The authors acknowledge the primary limitation of SPS*—the requirement for fξ(x)f_{\xi}(x^*)—and discuss scenarios where this information is available.

Weaknesses:

  1. Several mistakes appear in the reported convergence rates, particularly in Table 1. For example:

    1. The convergence rate of SGD (and SPS*) for LL-smooth problems should be LD2t+LD2σ\*2t\frac{L D^2}{t} + \sqrt{\frac{L D^2 \sigma_\*^2}{t}}, not LD2t+σ\*2logtLt\frac{L D^2}{\sqrt{t}} + \frac{\sigma_\*^2 \log t}{L \sqrt{t}}. In particular, for σ=0\sigma_* = 0, this should recover the standard rate LD2t\frac{L D^2}{t} of gradient descent.

    2. For LL-smooth strongly convex problems, the complexity to reach an ϵ\epsilon-solution (in terms of function value) should be Lμlogf(x0)fϵ+σ\*2μϵ\frac{L}{\mu} \log \frac{f(x_0) - f^*}{\epsilon} + \frac{\sigma_\*^2}{\mu \epsilon}, not L2D2μ2t2+σ\*2μ2t\frac{L^2 D^2}{\mu^2 t^2} + \frac{\sigma_\*^2}{\mu^2 t}. In particular, for σ=0\sigma_* = 0, this should recover the standard estimate for gradient descent.

    3. The reported rates for SPSmax and NGN appear incorrect, particularly as they are not invariant under function scaling. Also, σpos\sigma_{pos} is likely equivalent to σ2\sigma_*^2.

  2. The novelty of this work compared to previous SPS* results (e.g., Gower et al., 2021; Garrigos et al., 2023) is not entirely clear. While the results seem more general, it would help to explicitly clarify their differences.

  3. The impact of mini-batching on method performance is unclear. Clarification would be useful, especially given that for SGD, condition (9) is usually written in terms of variance rather than second moments, showing that both AA and BB decrease with increased mini-batch size.

其他意见或建议

  1. There are inconsistent definitions of σ\sigma_* throughout the paper (e.g., in Table 1, Corollary 2.3, and Theorem E.1).

  2. Some rates in Table 1 are stated in terms of function suboptimality, while others use distance to the solution. A consistent presentation using function suboptimality would be preferable.

  3. Line 142 is missing references to recent works on the Polyak stepsize for (L0,L1)(L_0, L_1)-smooth optimization:

    • Gorbunov et al. Methods for Convex (L0,L1)(L_0, L_1)-Smooth Optimization: Clipping, Acceleration, and Adaptivity, 2024.
    • Vankov et al. Optimizing (L0,L1)(L_0, L_1)-Smooth Functions by Gradient Methods, 2024.
  4. Line 301: Likely meant to reference (13) instead of (14).

  5. Line 308: The statement "Note that SGD ..." should be checked for accuracy.

  6. Theorem E.1: The given estimate is incorrect. The correct bound should be O(LD2T+σDT)O(\frac{L D^2}{T} + \frac{\sigma_* D}{\sqrt{T}}).

作者回复

Thank you for your detailed review. It is a pleasure to have an expert reviewer. Below we address each of your questions.

Q1 mistakes …in Table 1

The rates we report throughout are the anytime rates of each method. We do not report the fixed horizon rate. We will now include the following exert in our revision:

We say xtx_t has an anytime convergence rate if, once parameters are fixed, there exists a real function r:[0,+)[0,+)r:[0,+\infty) \to [0,+\infty) such that E[f(xt)f(x)]r(t)\mathbb{E}[f(x_t) - f(x)] \leq r(t) and r(t)0r(t) \to 0 as t+t \to + \infty.

We say an algorithm has a finite horizon rate if, for every tolerance ϵ>0\epsilon>0, there exists a choice of parameters and there exists T1T \geq 1 such that E[f(xT)f(x)]ϵ\mathbb{E}[f(x_T) - f(x)] \leq \epsilon.

We focus on anytime rates because these are the rates that best translate into practical performance. Where-as fixed horizon rates depend on very small step sizes, and are completely impractical when implemented.

For example, SGD with a constant stepsize does not have an anytime convergence rate, but it has a finite horizon rate. The anytime rate for SGD on convex smooth problems is O(log(t)/t)O(\log(t)/\sqrt{t}) (Theorem 5.7 in [Garrigos & Gower, 2023]) and the fixed horizon rate is O(1/t)O(1/\sqrt{t}) (Theorem 5.5 in [Garrigos & Gower, 2023]) which is the rate you pointed out.

Q2 Rates for SPSmax and NGN … not invariant under function scaling

SPSmax and NGN do depend on the function scaling. If we turn to eq (4) for SPSmax, it is not invariant because the constant γb\gamma_b does not scale with the function. Where as the learning rate of NGN is given by γk=σf(xk)f(xk)+σf(xk)2\gamma_k = \frac{\sigma f(x_k)}{f(x_k) + \sigma \| \nabla f(x_k)\|^2} The f(xk)2\| \nabla f(x_k)\|^2 scales quadratically, whereas f(xk)f(x_k) scales linearly with ff. Thus NGN cannot be invariant to scaling the function.

Q3 σpos\sigma_{pos} is likely equal to σ2\sigma_{*}^2

These two constants are not equivalent, instead σ2=f(x\*)σpos. \sigma_{*}^2 = f(x_{\*}) - \sigma_{pos}. In the setting where σpos=0\sigma_{pos} =0 (regression models with no weight decay), we still could have σ20\sigma_*^2 \neq 0

Q4 novelty compared to previous SPS* results

The novelty of our analysis of SPS* is in the smooth setting. Outside of the interpolation setting, we give the first convergence for SPS* in the smooth setting. Previous results in Gower et al., 2021 and Garrigos et al., 2023 only establish that SPS* converges to a fixed neighborhood of the solution.

Q5 impact of mini-batching

To see the dependency on mini-batching, we change the expectation in eq (9) to be over the mini-batching, instead of directly over ξ\xi. Then AA and BB will now depend on the batch size bb. For example, if we use sampling without replacement in the smooth nn--finite sum setting, then according to Proposition 3.8, item (iv) [Gower et al 2019] the constants AA and BB in eq (9) have the following form

A=nbb1n1Lfull+nbb(n1)Lmax A = \frac{n}{b}\frac{b-1}{n-1} L_{full} + \frac{n-b}{b(n-1)} L_{\max} B=1bnbn1σ2 B = \frac{1}{b}\frac{n-b}{n-1} \sigma^2_*

Where LfullL_{full} and LmaxL_{\max} are the full batch and max over stochastic functions smoothness constant, respectively.

For the non-smooth setting, the right hand side in (11) would be G2/bG^2/b if we used bb--sampling with replacement.

Q6 inconsistent definitions of σ\sigma_* throughout

Thank you and well spotted. We will no be consistent and use function noise σf2=f(x)Einffξ\sigma_f^2 =f(x_*) - \mathbb{E}{\inf f_{\xi}} throughout.

Q7 Table 1 uses function suboptimality and distance to the solution.

We will convert all results to function suboptimality, at the cost of an additional fixed 1/μ1/\mu multiplicative factor for the strongly convex problems.

Q8 Missing references

Thank you for the two additional references, we will include them both, and comments about their relative smoothness results.

Q9 Table 1 SGD has "X" in the first column? Constrain to the ball

We agree, but this would require access to the ball BDB_D that contains xx_*, so that the iterates xkx_k can be projected back onto BDB_D. But this is a different oracle access. Depending on the problem, we would not have access to this ball. But we can include a footnote on this.

Q10 Why not define σ\sigma_* in (13)

We use Corollary 2.3 to give a precise setting, and Theorem 2.1 and eq (9) for the general setting. But we do agree that Corollary 2.3 also holds when using gradient noise given by σ=LEgξ(x)2\sigma_* = L \mathbb{E}\|g_{\xi}(x_*)\|^2.

Q11 Why introduce Bregman div in (20)

We introduce the Bregman div because it is a positive term that provides a minor speed-up of IAM over SPS*.

Q12 What if B=0B=0 in Theorem G.1

Good question, the result in Theorem G.1 does not hold for B=0.B=0. Instead when B=0B=0, from equation (49) we can prove that SPS* converges linearly according to Ext+1x2(1μ/(2A))Extx2\mathbb{E}\\| x_{t+1}-x_*\\|^2 \leq (1-\mu/(2A))\mathbb{E}\\|x_{t}-x_*\\|^2 We will edit the theorem to include this remark.

审稿意见
4

The paper provides an analysis in different convex settings of the stochastic Polyak step size SPS* and a version with momentum/iterate averaging they call Iterate Averaging Adaptive Method (IAM). The paper analyzes convergence rates in both non-smooth and smooth settings, and shows that both SPS* and IAM are able to adapt to the interpolation noise level σ2\sigma_\ast^2 to interpolate between 1/T1/\sqrt{T} and 1/T1/T rates. They also discuss the limitations imposed by needing access to fξ(x)f_\xi(x_\ast), the value of the optimum on each batch, and presents a model distillation setup in which this is possible to estimate using the teacher network.

给作者的问题

n/a

论据与证据

Yes. The focus of this paper is on proving convergence rates in for SPS* and IAM for convex objectives and all proofs are presented in the appendix.

方法与评估标准

The paper evaluated SPS* and IAM (and the Adam variants) on a model distillation task, which makes sense as a way to circumvent the main issue with these methods, which is that they require knowledge of fξ(x)f_\xi(x_\ast). While the experiments are fairly limited/small scale, they serve as a nice proof of concept.

理论论述

Yes. I checked the proofs in Appendices C, D.1-D.4 and skimmed the remaining proofs in the appendix.

实验设计与分析

n/a

补充材料

See "Theoretical Claims"

与现有文献的关系

This paper continues a line of work extending the Polyak step size to the stochastic setting. They provide clean and simple proofs for convergence for SPS* and extend the same arguments to IAM. These methods also fall into the category of "parameter free" methods, although they require access to fξ(x)f_\xi(x_\ast) which may be difficult to estimate.

遗漏的重要参考文献

n/a

其他优缺点

Strengths:

  • Both the main paper and the appendix are well written and have clear comparisons to prior work
  • The proof sketch for SPS* (Appendix D.1) is very clear and makes it easy to read the full proof in D.2.
  • The paper is forthcoming about its weaknesses (e.g. the need to estimate fξ(x)f_\xi(x_\ast)) and discusses situations in which this is possible.

Weaknesses:

  • It is not clear that the theory is relevant for the experiments in Figure 1 as the setting is highly non-convex. However, this is a common issue faced by papers in convex optimization theory and is not the focus of the paper.

其他意见或建议

Lines 307-308: There should be a square root for the SGD convergence rate to match eq. (42)

作者回复

Dear Reviewer,

thank you for your thoughtful review, and doing the expert work of checking our proofs and appendix, and catching the type-O on Lines 307-308. Regarding the weakness you highlighted

It is not clear that the theory is relevant for the experiments in Figure 1 as the setting is highly non-convex. However, this is a common issue faced by papers in convex optimization theory and is not the focus of the paper.

We do of course agree. But we have written a small paragraph “Defense of convex optimization in machine learning” in response to reviewer UUoW rejection of our work because of the convexity assumption. We copy the relevant parts of our response here for your consideration, though it does not truly address your very valid point.

Defense of convex optimization in machine learning.

Stochastic non-smooth convex optimization has a surprising impact on deep learning practice.

For instance, the development of the Adagrad method [Duchi2011] was based on its convergence analysis in the non-smooth convex setting. Adagrad went on to be widely used in Deep learning, until the development of exponentiated weighted variants of Adagrad such as RMSprop and ADAM. Thus non-smooth convex analysis has a significant role to play in inspiring the most popular optimization ADAM for deep learning. Much more recently, the Schedule-free method [Defazio2024] which was developed to have fast last-iterate convergence in the non-smooth convex setting, set a new record in the AlgoPerf competition in the self-tuning category:

https://mlcommons.org/2024/08/mlc-algoperf-benchmark-competition/

The fact that non-smooth convex analysis is a reasonable model for predicting performance on Deep learning, including training large language models, is now a well-established conundrum [Schaipp2025]. Where-as, the convergence analysis of SGD in the non-convex setting has not had such an impact on Deep learning practice. As such, convex optimization has a surprisingly important and active role to play in deep learning practice.

[Duchi2011] Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research 12 (7): 2121–2159, 2011

[Defazio2024] Aaron Defazio, Xingyu Alice Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, Ashok Cutkosky, “The Road Less Scheduled”. Neurips Oral 2024

[Schaipp2024] Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach, “The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training”, arXiv:2501.18965, 2025

审稿意见
1

This paper presents some optimization methods with idealized stochastic Polyak step (SPS) sizes and their convergence analyses under the convexity conditions of objective functions. The analyses are based on the following two methods:

[Section 2] Stochastic gradient descent (SGD) with SPS (Theorem 2.1) under

  • Expected locally Lipschitz assumption (Corollary 2.2)
  • local expected smoothness assumption (Corollary 2.3)

[Section 3] Iterate averaging adaptive method (IAM) with SPS under

  • Expected locally Lipschitz assumption (Theorem 3.2)
  • local expected smoothness assumption (Theorem 3.3)

It also provides some numerical results to support their analyses (Section 4).

update after rebuttal

Thank you for your comments. I checked all of your replies.

I have discussed my concern with AC and other reviewers during AC and Reviewers discussion period. As a result, I am not satisfied with your recent response.

I believe that the assumption such that fif_i is convex on a neighborhood at the global minimizer xx_* of f=(1/n)i=1nfif = (1/n) \sum_{i=1}^n f_i is strong, since I do not know practical examples of ff and fif_i satisfying the assumption and we have many counterexamples. I asked the authors for providing examples of the assumptions in Sections 2 and 3 and the authors said that "As for the Black-box Model Distillation in Section 4.1, the student models are decoder only transformers, as such, they are non-convex deep neural networks." Unfortunately, I am not satisfied with the authors' response, and hence, at this time, I conclude that the assumption such that fif_i is convex on a neighborhood at the global minimizer xx_* and the assumptions in Sections 2 and 3 are unrealistic.

It would be grateful if the authors could provide practical examples such that fif_i is convex on a neighborhood at the global minimizer xx_* of f=(1/n)i=1nfif = (1/n) \sum_{i=1}^n f_i. At least, I have no example of such fif_is. Hence, I believe the revised manuscript should be submitted to other venues. My opinion is that the convexity condition over the whole space is unrealistic, while the convexity condition over a neighborhood at a local minimizer is acceptable.

给作者的问题

Theoretical Concerns: For simplicity, let us consider a finite sum minimization (Remark 2.4). I understand a consideration under the convexity condition is significant for ML community. However, I believe that the assumption in Theorem 2.1 such that f_ξf\_\xi is convex over the whole space Rd\mathbb{R}^d is strong for ML problems. For example, since the cross-entropy loss used in Section 4 (Page 8) is non-convex over the whole space, the assumption is strong. Let xx^* be a local minimizer of fif_i (or ff) . We must replace the assumption with (A1) ``f_if\_i is convex over a δi\delta_i-neighborhood at xx^* (or ff is convex over a δ\delta-neighborhood at xx^*), while ff and f_if\_i are non-convex over the whole space Rd\mathbb{R}^d." The δ\delta-neighborhood at xx^* is denoted by N(x;δ)N(x^*; \delta). Here, I have the following concerns under (A1):

  • Does xtx_t (t=1,2,)(t = 1,2,\cdots) generated by SGD (2) belong to N(x;δ)N(x^*; \delta)? That is, is it guaranteed that the points generated by SGD are in a neighborhood at a local minimizer of ff with probability 1? If xtN(x;δ)x_t \notin N(x^*; \delta), then, since fif_i (or ff) is not convex around xtx_t, the results in Theorem 2.1 (in particular, (8) and (10)) do not hold. Therefore, we must prove that, for all tt, xtN(x;δ)x_t \in N(x^*; \delta) with probability 1.
  • We assume that ff is continuously differentiable. Even if xtN(x;δ)x_t \in N(x^*; \delta), we do not know whether xt+1=xtγtgtx_{t+1} = x_t - \gamma_t g_t is in N(x;δ)N(x^*; \delta), since gtg_t is a stochastic gradient of ff (i.e., there exists an upper bound of the variance σ2\sigma^2 such that E_ξ[gtf(xt)2]σ2\mathbb{E}\_\xi [\Vert g_t - \nabla f (x_t)\Vert^2] \leq \sigma^2).

The above concerns are applied to the results of IAM (Section 3).

Practical Concerns: Do the objective functions considered in Section 4 satisfy the assumptions in Sections 2 and 3 (for example, in Theorem 2.1, f_ξf\_\xi is convex over Rd\mathbb{R}^d and (9))?

From the above discussion, my recommendation is Reject. However, I can change my score if the authors could address the above theoretical and practical concerns.

论据与证据

I am an emergency reviewer of the paper ICML Submission 261. Hence, I could not check all the proofs of the theorems. However, I have theoretical and practical concerns for Claims And Evidence. Please see Questions For Authors.

方法与评估标准

Numerical comparisons in Section 4 would be insufficient to support the performances of the proposed methods. More numerical comparisons with large-scale datasets and networks would be needed.

理论论述

I have some theoretical concerns. Please see Questions For Authors.

实验设计与分析

I have some practical concerns. Please see Questions For Authors.

补充材料

N/A

与现有文献的关系

The paper compares the proposed analyses with the existing ones (Table 1). However, more comparisons and discussions for the existing analyses are needed. For example, it is needed to discuss not only convex setting but also non-convexity setting, and it is needed to compare SGD, momentum, and adaptive methods using mini-batch stochastic (sub)gradients with the proposed methods.

遗漏的重要参考文献

I think that the references are sufficient.

其他优缺点

I have theoretical and practical concerns. Please see Questions For Authors.

其他意见或建议

I am an emergency reviewer. I am sorry to find only the following typos:

  • Abstract: 022: a O --> an O
  • (3): ``support" is not defined.
  • Theorem 2.1: I guess that E_ξ\mathbb{E}\_\xi is an expectation with respect to ξ\xi. Meanwhile, E\mathbb{E} in (10) is not defined.
作者回复

Dear reviewer, the main issue you have raised is regarding our assumption in our theorems that the loss is convex. We will divide our answer into two parts. In our first part, we will show how to relax global convexity to a form of local convexity. Indeed, this extension follows straightforwardly by using that SPS* and IAM have the very unusual property of being monotonic (Eq (8) and Lemma 3.1). In our second part, we write a defense for convex optimization in machine learning. After which we answer a minor point below.

We hope the reviewer will reconsider their position in light of our response.

Extending beyond convexity.
Our theorems do not require global convexity. Instead, we need only assume that for the given starting iterate x0Rdx_0 \in \mathbb{R}^d, that the fξf_\xi functions are almost surely convex within the closed ball

BD(x)={x:xxx0x}B_D(x^*) = \lbrace x : \Vert x - x^* \Vert \leq \Vert x_0 - x^* \Vert \rbrace

Using your notation, we need convexity in an δ\delta-neighborhood around xx^* where δ=x0x.\delta = \Vert x_0 - x^* \Vert .

The reason we are able to relax the global convexity assumption is because SPS* and IAM have the very unusual property (amongst stochastic methods) of being monotonic. That is, for the iterates xtx_t (or zt z_t for IAM), we have that

xt+1xxtxx0x.\Vert x_{t+1} - x_* \Vert \leq \Vert x_t - x_* \Vert \leq \cdots \leq \Vert x_0 - x_* \Vert.

The same cannot be said about SGD or any variant of SGD that we are aware of, since the iterates may escape the closed ball with some probability. We will include a remark on this in our revision.


Defense of convex optimization in machine learning.
As a mere formality, we first point out that convex optimization is one of the official topics of interest listed in ICML’s 2025 call for papers:
https://icml.cc/Conferences/2025/CallForPapers

As such, we do not believe it is a valid reason to recommend rejecting our paper because our analysis focuses on stochastic convex optimization. Moreover, stochastic non-smooth convex optimization has had, and continues to have a surprising impact on deep learning practice.

For instance:

  • Adagrad [1] was developed based on its favourable non-smooth convex anytime convergence. Adagrad went on to be widely used in Deep learning, until the development of exponentiated weighted variants of Adagrad such as RMSprop and ADAM. Thus non-smooth convex analysis has a significant role to play in inspiring the most popular optimization method (ADAM) for deep learning

  • The Schedule-free method [2], designed based on its fast last iterate convergence for non-smooth convex settings, just recently set a record and one first place in the AlgoPerf competition in the self-tuning category for deep learning (https://mlcommons.org/2024/08/mlc-algoperf-benchmark-competition/).

The fact that non-smooth convex analysis has some predictive power on the performance in deep learning (including LLMs) is now a well-established conundrum [3]. In contrast, non-convex SGD analysis has had far less practical impact on deep learning. Thus, convex optimization remains relevant to deep learning practice, justifying its inclusion in ICML’s topics of interest.

Q1. Do the objective functions considered in Section 4 satisfy the assumptions?

A1. The Poisson regression experiments mentioned in Section 4 (and relegated to Appendix L.1 due to space constraints), is a globally convex problem. In fact, this is one problem for which SPS* and IAM provenly converge (see Remark 2.4), and for which we are unaware of any other method that converges. This experiment was to highlight the predictive power of our theory. As for the Black-box Model Distillation in Section 4.1, the student models are decoder only transformers, as such, they are non-convex deep neural networks. Thus our theorems do not explicitly hold for these problems.

References

[1]: Duchi, J., Hazan, E., & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159, 2011

[2]: Aaron Defazio, Xingyu Alice Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, Ashok Cutkosky. The Road Less Scheduled, Oral at NeurIPS 2024.

[3]: Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach. The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training, arXiv:2501.18965, 2025

审稿人评论

Thank you for your comments. However, I am not satisfied with your comments.

Extending beyond convexity: I do not know why SPS* and IAM have the very unusual property (amongst stochastic methods) of being monotonic. Could you prove it without using the convexity condition over the whole space? (this is my question. Please check the above comments.)

Defense of convex optimization in machine learning: I understand a consideration under the convexity condition is significant for ML community. (Please check the above comments.)

A1: Thank you for your reply. However, your comment implies that the convexity condition is strong in practice.

Therefore, my score remains unchanged.

作者评论

Dear Reviewer,

thank you for engaging further, giving us a chance to address your concerns. We address all of your questions/concerns below.

Q2.1 Extending beyond convexity and the descent lemma.

Yes, we can prove the descent lemma for SPS* and IAM using only local convexity. The argument is a simple induction. In words, our induction proof will establish that, if we assume fξf_{\xi} is almost surely convex in BD(x0):={x:xxx0x},B_D(x_0) := \lbrace x : \Vert x - x_* \Vert \leq \Vert x_0 - x_* \Vert \rbrace, then all the iterates xtx_t are in BD(x0)B_D(x_0) , and consequently we have a descent lemma at every step.

Now for the proof: Suppose that x0x_0 is such that f0:=fξ0f_0 := f_{\xi_0} is almost surely convex at x0,x_0, in other words for t=0t=0 we assume

ft(x)ft(xt)+gt,xxt,f_t(x) \geq f_t(x_t) + \langle g_t, x- x_t \rangle, where gtg_t is a stochastic subgradient.

Using the above and (6) in our paper, and for t=0t=0, we then have that:

xt+1x2xtx22γt(ft(xt)ft(x))+γtgt2\\| x_{t+1} - x_* \\|^2 - \\| x_{t} - x_* \\|^2 \leq 2 \gamma_t (f_{t}(x_t) - f_t(x_*)) + \gamma_t \\| g_t\\|^2

Minimizing the right hand side in γt0\gamma_t\geq 0 gives γt=(ft(xt)ft(x))+gt2 \gamma_t = \frac{(f_{t}(x_t) - f_t(x_*))_+}{\\| g_t\\|^2}

Plugging this back into the above gives xt+1x2xtx2(ft(xt)ft(x))+2gt2\\| x_{t+1} - x_* \\|^2 - \\| x_{t} - x_* \\|^2 \leq -\frac{(f_{t}(x_t) - f_t(x_*))_+^2}{\\| g_t\\|^2}

This shows that we have descent at step t=0t=0 and consequently x1x_1 must be closer to the solution xx_* as compared to x0x_0, that is x1BD(x0).x_1 \in B_D(x_0).

Now our induction argument is that if xtx_t is in BD(x0)B_D(x_0), then xt+1x_{t+1} is in BD(x0)B_D(x_0). The proof of the induction step follows the exact same steps above but for any tt (instead of just t=0t=0). Because of this, we need only assume fξf_{\xi} is convex in BD(x0)B_D(x_0) and not globally.

Q2.2 your comment implies that the convexity condition is strong in practice

The fact that Poisson regression is globally convex, does not show global convexity is a strong assumption. Poisson regression is one of many globally convex machine learning problems that enjoys countless applications. In terms of theory, assuming only global convexity is a very weak assumption. More assumptions are always included (Lipschitz or Smooth) in order to achieve a convergence rate.

To understand further how global convexity is not a strong assumption, consider the fact that Poisson regression is being applied to numerous tasks in Operations and Queueing Systems (counting arrivals, customer call, requests per second), in Insurance and Actuarial Science (counting number of claims per policyholder, accidents ..etc), in Neuroscience ( Modeling spike counts, counting event-related potentials), and the list goes. There are entire book dedicated to the applications of Poisson regression, such as

Cameron AC, Trivedi PK. Regression Analysis of Count Data. 2nd ed. Cambridge University Press; 2013.

Thus, despite Poisson regression being globally convex, it gives us access to all these applications.

Furthermore, there are countless training problems in machine learning that are globally convex such as

  • Linear Regression
  • Support Vector Machines
  • Logistic Regression
  • Boosting (for instance Adaboost with an exponential loss)
  • Conditional Random Fields
  • Linear Quadratic Regulator (LQR) (in Optimal control and Reinforcement learning)
  • Kernel Ridge Regression

The list goes on, and is also the subject of several text books such as

Changho Suh, Convex Optimization for Machine Learning, Now Publishers (2023)

Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press (2014)

In summary, (globally) convex problems in machine learning is an entire field. Though it excludes deep learning, it still enjoys many daily and commonly used applications. The fact that our method is both stochastic and only relies on convexity within BD(x0)B_D(x_0) is exceptional within the field.

We hope the review will reconsider their position in light of our response.

审稿意见
3

This paper studies an idealized method of stochastic Polyak step size called SPS*, assuming access to the optimal loss value that can be achieved by a solution. It is shown that SPS* achieves the optimal convergence bound for globally Lipschitz function, and also enjoys anytime convergence with rate O(1/t)O(1/\sqrt{t}) for smooth functions. It is also shown that momentum can be incorporated to guarantee the convergence rate of the last iterate. Acknowledging the impracticality of having access to the optimal loss value in general, the paper also identifies settings where it is possible to approximate the optimal loss value, including 1) the interpolation setting where the optimal loss value should be 0, and 2) the setting where the optimal loss value can be approximated, e.g., model distisllation where the optimal loss value can be approximated by the teacher model.

给作者的问题

Why does the approximation of the optimal loss value have to be an underestimate?

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed methods and evaluation criteria make sense for the problem at hand.

理论论述

I did not check the correctness of the proofs in the appendix.

实验设计与分析

The experimental designs and analyses discussed in the main text seem sound.

补充材料

I took a quick look at the supplementary details for the experiments in Appendix L. The appendix provides additional details on the experimental setup and results, which are helpful for understanding the empirical evaluation.

与现有文献的关系

The key contributions of the paper are about the convergence properties of the method of Polyak step size, and the incorporation of momentum to guarantee the convergence rate of the last iterate. This is related to the broader literature about learning rate schedules and momentum methods in optimization.

遗漏的重要参考文献

The essential references have been properly discussed in the paper.

其他优缺点

The paper is well written and presents a clear comparison between the convergence rates of SPS* and other existing methods. Though SPS* is an idealized method, the authors provide a clear motivation for studying this method and show that it can be useful in practice in certain settings, where the example of model distillation is particularly interesting.

It would be helpful if further discussions are provided on how to approximate the optimal loss value in general settings, and how the convergence analysis would be affected by the approximation error.

其他意见或建议

N/A

作者回复

We thank the reviewer for their thoughtful review, and appreciating our work.

Q1. It would be helpful if further discussions are provided on how to approximate the optimal loss value in general settings, and how the convergence analysis would be affected by the approximation error.

A1. This is a good question. We are currently working on a follow-up on how the optimal loss value could be approximated on language models. The optimal cross-entropy loss over the training data is lower bounded by the average number of bits used to compress this data when using an optimal compression method. The final loss over a batch is then given by the average number of bits used to compress this batch. Though we do not have access to an optimal compressor, compressing is in general an easier task than predicting the next token, and thus, we do have access to very good compressors. This would suggest the following approximation: Compress the training data using a good compressor     \implies Use the average number of bits used to encode a batch of data as an approximation to the optimal batch loss     \implies Run IAM with this approximation. We can consider adding this to potential future work, if the reviewer thinks it is insightful or partially answers their question. We are also considering how to extend our convergence analysis to allow for an approximation error. This has proved challenging, and we cannot guarantee that our revision will include results on this.

审稿意见
3

This paper presents a new anytime convergence result for stochastic polyak step-size which requires access to the loss for every training batch evaluated at a solution. They also provide a momentum variant of this method and prove convergence rates with respect to the final iterate. Lastly, they provide a small set of experiments which compare an approximation of their momentum optimization algorithm to other standard optimizers for toy experiments and a distillation problem.

给作者的问题

Questions I had while reading the manuscript

  • Can you explain this statement more formally: "By anytime, we mean a proof that the method converges to any predefined tolerance without prior knowledge of that tolerance."
  • "The proof of convergence for SPS* in the Lipschitz convex and strongly convex setting was first given in (...)" - the main contribution is then your convergence result (for the same algorithm) is any-time vs not anytime.
  • " ... but they either require bounded domain (Levy et al., 2018; Kavis et al., 2019) or are only studied in the deterministic setting (Li & Lan, 2023)." Can you clarify what the difference between this assumption and what you assume in line 270 is?
  • The teacher student setting does not really explain why that would be any different from a general stochastic setting. Why would we expect a better approximation to zeta there over other settings. Is it just that the surrogate (from the teacher) should be close to the minimum provided by the original problem? If so this seems very related to the online imitation learning setting.
  • You state that "assuming access to fξ(x∗) is not the same as assuming that interpolation holds", but again I am a bit unclear on the relationship between this assumption and the general stochastic / deterministic assumption.

论据与证据

SPS* Convergence Rates

Claim:

  • SPS* achieves optimal/best-known convergence rates under a new set of assumptions not considered by existing optimization literature

Evidence:

  • Formal proofs in Theorem 2.1 and Corollaries 2.2, 2.3
  • Shows O(1/√T) rate for Lipschitz functions which matching known lower bounds
  • First O(1/√T) anytime convergence rate for smooth convex case

IAM Last-Iterate Convergence

Claim:

  • IAM achieves same favorable rates as SPS* but for last iterate

Evidence:

  • Theoretical proofs in Theorems 3.2 and 3.3
  • Avoids need for iterate averaging which is less practical
  • Matches optimal rates in non-smooth case

Adaptivity to Interpolation

Claim:

  • Methods automatically adapt to interpolation settings

Evidence:

  • Theoretical analysis showing rates improve when interpolation parameter σ* approaches 0
  • Proofs show O(1/T) convergence under interpolation vs O(1/√T) otherwise

Convergence Without Smoothness/Lipschitz Assumptions

Claim:

  • Methods converge on non-smooth, non-Lipschitz problems

Evidence:

  • Experiments on Poisson regression showing convergence comparable to L-BFGS
  • Performance matches SGD with best manually-tuned learning rate

Robustness to Misspecification

Claim:

  • Methods are somewhat robust to errors in optimal loss estimates

Evidence:

  • Experiments showing convergence (though slower) when using approximate/averaged values
  • Detailed ablation studies with different levels of noise

Effectiveness for Model Distillation

Claim:

  • Methods enable efficient distillation without hyperparameter tuning

Evidence:

  • Experiments distilling large GPT-2 models on three datasets
  • Competitive performance vs carefully tuned SGD/Adam baselines
  • Learning rate plots showing sensible adaptive behavior

方法与评估标准

Empirical

The methods and evaluation criteria are appropriate for the stated goals, with evaluation chosen to:

  • Demonstrate practical utility in distillation
  • Test robustness to key assumptions

Weaknesses

  • Could include more diverse set of optimization problems
  • Relatively small number of datasets
  • Limited comparison to other adaptive methods (especially Adagrad / SLS variants)
  • One SLS variant, SSO, seems designed for applications related to the distillation problem (online imitation learning) and may be a good variant to consider evaluation agains.
  • No evaluation or discussion of computational overhead
  • Some confusion (for me) around why distillation satisfies their assumption (outside of non-convexity issues)

理论论述

I did not have time to work through the entire appendix, however the results which they present seem reasonable given the assumptions that they make regarding access to the true function evaluation for every batch. I would like some additional discussion around exactly how this assumption relates to interpolation settings and deterministic settings. It just kinda seems like the assumption of knowing the true f_zeta seems to reduce the problem to a deterministic one, which from my understanding becomes effectively equivalent to an interpolation setting.

实验设计与分析

Poisson Regression Experiments:

Summary

  • Comparison methodology against L-BFGS and SGD
  • Learning rate sweep for baselines (0.001 × {0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 20, 50})
  • Dataset characteristics

Sound aspects:

  • Appropriate baseline methods
  • Reasonable learning rate range
  • Clear reporting of epochs/convergence

Issues:

  • No mention of number of repeated runs/statistical significance

Misspecification Study:

Summary:

  • Experimental setup with synthetic quadratic problems
  • Three versions of IAM tested (theoretical, averaged, lower-bound)

Sound aspects:

  • Well-controlled synthetic setup
  • Clear parameter settings
  • Systematic variation of noise (ν) and batch size (b)

Issues:

  • No error bars/variance reporting
  • Single problem structure (quadratic) may limit generalizability

Distillation Experiments:

Summary:

  • Dataset splits and sizes
  • Model architectures
  • Hyperparameter tuning procedure
  • Baseline comparison methodology

Sound aspects:

  • Multiple datasets
  • Clear documentation of model sizes
  • Systematic hyperparameter search for baselines

Issues:

  • No reporting of variance across runs
  • No discussion of computational costs
  • Training durations not clearly specified

Limitations in Experimental Design:

  1. Lack of statistical significance testing
  2. Missing error bars/variance reporting
  3. Incomplete documentation of computational requirements
  4. Limited ablation studies on algorithmic components

While the experimental designs are generally reasonable, the lack of statistical analysis and variance reporting makes it difficult to fully assess the robustness of the results (I did not catch a mention of seeds or multiple runs).

补充材料

I was only able to briefly browse the material in the appendix, and largely focused on the experimental results.

与现有文献的关系

I cannot speak to the theoretical impacts to the larger scientific community but the experimental results seem at best incremental, and are currently missing quite a few comparisons.

遗漏的重要参考文献

From my understanding of this area the authors seem to reference all of the folks which I am aware which work on these methods. There could however be some papers which are newer that I am not aware of.

其他优缺点

Strengths:

Theoretical Depth & Rigor:

  • Comprehensive convergence analysis
  • Clear proofs and technical innovations
  • Handles multiple function classes
  • Adapts automatically to interpolation

Practical Relevance:

  • Demonstrates real application in model distillation
  • Shows how to incorporate momentum and Adam-style preconditioning
  • Seems to avoid hyperparameter tuning in practice

Organization & Clarity:

  • Clear connection between theory and practice
  • Well-structured presentation
  • Thorough appendices with detailed proofs

Technical Innovation:

  • Novel analysis techniques (e.g., explicit inversion of convex monotone function)
  • First O(1/√T) anytime rate for smooth case

Validation Strategy:

  • Tests both theoretical claims and practical utility
  • Uses multiple datasets and model sizes
  • Includes ablation on misspecified optimal losses

Weaknesses:

Experimental Rigor:

  • No reporting of random seeds
  • Missing error bars/statistical significance
  • Limited ablation studies
  • No analysis of computational overhead

Limited Comparisons:

  • Insufficient theoretical comparison to AdaGrad
  • Limited empirical comparison to other adaptive methods
  • Relatively small number of datasets

Practical Limitations:

  • Core method (SPS*) requires access to optimal batch losses
  • Theoretical results assume convexity
  • Primary applications limited to special cases (interpolation/distillation)

Missing Analysis:

  • No discussion of computational complexity
  • Limited analysis of robustness to hyper-parameters
  • No investigation of failure modes

Empirical Gaps:

  • Limited diversity in optimization problems and algorithms tested
  • No large-scale benchmarking

其他意见或建议

Most of my issues are on the experimental side, so if you could add additional baselines to your comparisons I would consider increasing my score, subject the other reviewers substantiating the relevance of the theoretical claims.

作者回复

We thank the reviewer for their comprehensive review. We have divided our answers in Main Questions, and Questions

MQ1 anytime convergence formally?

We say xtx_t has an anytime convergence rate if, once parameters are fixed, there exists a real function r:[0,+)[0,+)r:[0,+\infty) \to [0,+\infty) such that E[f(xt)f(x)]r(t)\mathbb{E}[f(x_t) - f(x)] \leq r(t) and r(t)0r(t) \to 0 as t+t \to + \infty.

We say an algorithm has a finite horizon convergence if, for every tolerance ϵ>0\epsilon>0, there exists a choice of parameters and there exists T1T \geq 1 such that E[f(xT)f(x)]ϵ\mathbb{E}[f(x_T) - f(x)] \leq \epsilon.

If an algorithm has an anytime convergence rate then it has a finite time convergence rate, but the reciprocal is not true in general. The counterexample is SGD with constant stepsize, which does not have an anytime convergence rate, but has a finite time convergence.

MQ2 main contribution is (SPS*) anytime rate?

Our main contributions are the anytime rates for SPS* in the smooth (convex+ str convex) setting, proposing the IAM method, all the last iterate analysis of IAM, and identifying a reasonable application (black box model distillation).

MQ3 Difference between bounded domain and line 270

The bounded domain assumes we know a bounded set BRd\mathcal{B}\subset \mathbb{R}^d such that xBx_* \in \mathcal{B}. If we know such a set, we can modify SGD to project xtx_t onto B\mathcal{B}.

Our assumption on line 270 and Eq (9), is very different and is almost for free. We require that the gradients verify some inequality on a certain ball, that we do not need to know. As we emphasize in Remark 2.4, for finite sum problems our assumption is always true, because convex functions always have bounded gradients on every bounded set.

MQ4 The teacher should be close to the minimum? Related to online imitation learning?

Yes, the teacher's loss is assumed to be close to the minimum student loss, and thus we can use the teacher loss to set the learning rate of the student using the IAM. There is some relation to the online imitation learning setting, both involve a student learning from a teacher, but there are differences. In online imitation learning the student learns a policy, and has to interact with an environment. Where-as in our setting there is no environment or policy.

MQ5 Unclear "access to fξ(x)f_{\xi}(x^*) not the same as interpolation"

This is a good question. In short, assuming access to fξ(x)f_{\xi}(x^*) imposes no restriction on the problem itself. This could be any stochastic optimization problem. Rather, this is an assumption as to what we know regarding the problem, and what we have access to.

The interpolation assumption does impose a constraint on what type of problems we are considering. Interpolation restricts our problem to models that can perfectly fit every single data point. This can (approximately) occur in image classification tasks, where the neural network is sufficiently overparameterized with respect to the data set. This cannot occur when training a generative language model, because it is impossible to perfectly guess the next token in a sequence due to ambiguities in languages. Because of this, the cross-entropy loss is never zero, and is always above the entropy of language (around 2.0). Thus the interpolation assumption cannot hold in generative language modelling. In contrast, assuming access to fξ(x)f_{\xi}(x^*) imposes no restriction to the model itself, and it can occur in setting such a generative language modelling. For instance, such as out black-box model distillation problem

Q1 Comparisons to Adagrad

In the non-smooth setting, Adagrad has a guaranteed convergence of 2GD/T\sqrt{2}GD/\sqrt{T}, which is 2\sqrt 2 worse than our rates in for SPS* and IAM in Eq (12) and (20). The difference between this rate of Adagrad and our rates for IAM, is this rate holds for the last iterate of IAM and Adagrad assumes access to a bounded domain DD which contains the solution. We can include numerical comparisons to Adagrad, but AdamW is SOTA for LLMs, not Adagrad, which perform poorly for LLMs.

Q2 Comparisons SLS variant, SSO

Neither SLS, or the SSO variant, are applicable in our black-box model distillation setting. SLS assumes that we can use the teacher to generate a candidate sequence, where-as SSO assumes we have access to the logits of the teacher. But our black-box setting assumes we only have access to the final loss of the teacher, and not the logits.

Q2 Discussion on computational overhead

Certainly, SPS* and IAM impose a small additional computational overhead as compared to SGD and SGD-M. Both SPS* and IAM need only store one additional scalar, and do two inner products with respect to the gradient. We will add this remark to the appendix.

Q3 Random seeds

We use 4242 to set all random seeds.

Q4 ​​No failure modes

We investigate failure modes regarding misspecification of the optimal loss values in Section L2.

最终决定

This paper analyzes the stochastic gradient method with Polyak stepsizes, termed SPS*. With access to the optimal loss value, the method adaptively selects stepsizes and is proven to achieve the optimal convergence rate for globally Lipschitz functions—without requiring knowledge of problem-specific constants such as the distance to the solution or the Lipschitz constant. The paper further extends SPS* by incorporating momentum, achieving nearly the same convergence rate but applicable to the last iterate instead of the average. Computational experiments demonstrate that the optimal loss value can be effectively approximated, enabling efficient implementation of SPS* with empirical performance.

However, one reviewer raised significant concerns regarding the assumption of local convexity. Specifically, the assumption that each component function fif_i is convex in a neighborhood of the global minimizer xx_* of f=1ni=1nfif = \frac{1}{n} \sum_{i=1}^n f_i is quite strong. The authors do not provide any practical examples to support this assumption, which undermines the applicability of the theoretical results in nonconvex settings. As such, the theoretical contributions may be seen as overly optimistic or unrealistic under practical scenarios. Given these concerns, the current version of the paper does not appear ready for publication. We encourage the authors to address the reviewers’ feedback and revise the manuscript accordingly.