PaperHub
6.8
/10
Rejected4 位审稿人
最低3最高6标准差1.3
3
3
5
6
3.3
置信度
创新性2.8
质量3.0
清晰度3.0
重要性2.5
NeurIPS 2025

Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications

OpenReviewPDF
提交: 2025-05-12更新: 2025-11-14

摘要

In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\mathsf{p}$-th central moment for some $\mathsf{p}\in\left(1,2\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping.
关键词
Online LearningOnline Convex OptimizationHeavy Tails

评审与讨论

审稿意见
3

This paper revisits classical algorithms for Online Convex Optimization (OCO) under heavy-tailed stochastic gradient noise (i.e., only finite pp-th moment for p(1,2]p \in (1,2]). The authors derive new regret bounds in expectation that are optimal in all parameters, and notably without requiring algorithmic modifications like gradient clipping. They further apply these results to both nonsmooth convex and nonsmooth nonconvex stochastic optimization.

优缺点分析

Strengths:

  1. The paper is well written, and the derivations are careful and accessible, even for readers not familiar with heavy-tailed optimization.
  2. Many results are given.
  3. Demonstrating that classical algorithms (OGD, DA, AdaGrad) achieve optimal regret bounds under heavy tails, without modification, is practically relevant.

Weaknesses:

  1. While the paper makes a clean and polished contribution, the core technical novelty—refined regret analysis of old algorithms under a well-posed heavy-tailed model—is limited. The technique of guarantees without clipping has been shown in the related work [1] and the technique of finite pp-th moment for p(1,2]p \in (1,2] in OCO has been also shown in the related work [2].

[1] Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping, ICLR 2025.

[2] Parameter-free regret in high probability with heavy tails, NeurIPS 2022.

  1. The work does not propose new algorithms, nor does it improve performance through any principled modification—most results follow from careful bounding rather than conceptual advancement.

问题

The author claims that the bounds this paper provided are optimal, and are optimal for all parameters. Can the author provide lower bounds to support this claim? Or provide specific results from relevant references to support this claim.

Because this paper focuses on theory. Can the author provide a detailed explanation of the technical innovation of the proof technique beyond the two references [1,2]?

[1] Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping, ICLR 2025.

[2] Parameter-free regret in high probability with heavy tails, NeurIPS 2022.

局限性

Yes

最终评判理由

I have carefully read the rebuttal with other reviewers and re-examined the paper.

(1) The core contribution—obtaining guarantees under heavy tails without gradient clipping—has already been established in prior work [1,2]. While the paper shifts from stochastic optimization [1] to online learning, the analytical machinery and proof strategy closely mirror [1,2] with no new techniques or inequalities, so novelty is limited.

(2) The claim of being optimal in all parameters is not supported by matching multi-parameter lower bounds.

Overall, a major revision is needed, adding lower bounds and stronger result types to increase the paper's innovation, and adopting a more measured tone about contributions.

[1] Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping, ICLR 2025.

[2] Parameter-free regret in high probability with heavy tails, NeurIPS 2022.

格式问题

NA

作者回复

We thank the reviewer for the valuable comments. Below we address the reviewer's concerns.

W1&Second part of Questions. We provide a detailed comparison with [1] and [2] here.

  • Comparison with [1]:
    • The major difference is that the studied problems are orthogonal. In detail, [1] considers smooth nonconvex optimization, i.e., optimizing a single objective and aiming at minimizing the gradient norm. However, we work on an online learning problem: Online Convex Optimization (OCO), where we handle a series of time-varying convex functions and want to minimize the regret. Therefore, for both objectives and metrics, [1] and our work are entirely different. Hence, whatever results [1] has obtained, they cannot be applied to our setting.
    • Next, from a technical view, [1] considers both gradient normalization and momentum, which is far from our work. Concretely, [1] has shown these tricks work for heavy-tailed smooth nonconvex optimization, but this doesn't mean they are effective for heavy-tailed OCO, since, again, the studied problems are distinct and their analysis is tailored for nonconvex problems. Moreover, the results in [1] cannot reflect whether OGD/DA/AdaGrad works for heavy-tailed OCO, as none of these algorithms employ normalization or momentum. So the technical ideas described in our Section 3 are independent of [1] and hence novel.
  • Comparison with [2]:
    • Although [2] also considers OCO with heavy tails, their motivation largely differs from ours as discussed in Lines 83-88:

      They aim to solve heavy-tailed OCO with a new proposed method that contains many nontrivial technical tricks, including gradient clipping, artificially added regularization, and solving the additional fixed-point equation. However, their result cannot reflect why the existing simple OCO algorithms like OGD work in practice under heavy-tailed noise. In contrast, our goal is to examine whether, when, and how the classical OCO algorithms work under heavy tails, thereby filling the missing piece in the literature.

    • Next, from a technical view, OGD/DA/AdaGrad contains none of the tricks employed in the algorithm of [2] (i.e., gradient clipping, regularization, and solving the fixed-point equation as mentioned above). Therefore, any proof technique in [2] used to handle the heavy-tailed noise cannot be applied to show that OGD/DA/AdaGrad has a finite regret, not to mention the optimality of these algorithms. As such, we need to analyze from scratch and find a new approach. Due to this, the ideas presented in Section 3 should be viewed as the main technical innovation of our work, as they cannot be found in [2] or anywhere in the existing OCO literature.
    • Moreover, the novelty of our techniques can also be reflected in the results. First, the regret bound GDT+σDT1/pGD\sqrt{T}+\sigma DT^{1/\mathsf{p}} of all three algorithms proved in our work is optimal in all parameters (see also our response to First part of Questions below). In contrast, the regret bound (G+σ)DT1/p(G+\sigma)DT^{1/\mathsf{p}} shown by [2] is not. This indicates that our analysis is tight, which is again due to our novel technique. Second, our result for AdaGrad is particularly interesting since it is adaptive to all problem-dependent parameters (i.e., GG, σ\sigma, and p\mathsf{p}). However, [2] requires all of them in their algorithm.

W2. We believe our study has its own worth even without newly proposed algorithms or any principled modification, due to the following reasons.

  • First, from a purely theoretical perspective, we provide a new way to analyze classical OCO algorithms under heavy tails. Importantly, it helps us establish the first entirely optimal regret bound for heavy-tailed OCO, which we believe is an important contribution. Moreover, what we view as particularly valuable in the new analysis is its simplicity, while remaining insightful, which makes the proof clear and accessible, as pointed out by the reviewer. We believe this balance of depth and reader-friendliness is particularly precious in nowadays theoretical work.
  • Next, applications in Section 4, based on our new optimal regret bounds established in Section 3, also provide interesting theoretical results. For example, the first optimal convergence rate of SGD for nonsmooth convex optimization under heavy tails, which even holds for the last iterate. As commented by another Reviewer pPe8, this may shed light on some phenomena during training LLMs. Furthermore, we also show how to optimize nonsmooth nonconvex objectives and give the first provable result without gradient clipping in Corollary 3, which not only improves over the best existing sample complexity of [4] but is also the first time to explain their experimental observation, i.e., good empirical performance of OGD + O2NC even without gradient clipping. Additionally, we also tackle the question explicitly asked by [4]: how to establish a provable result for the case where problem-dependent parameters are unknown in advance. For more details, we kindly refer the reviewer to Lines 310-312.
  • Moreover, one important implication on the practical side is to reverse the popular view that gradient clipping is necessary for SGD when facing heavy tails, and (partially) answer why many existing algorithms already work very well in practice, which we believe can help the community understand more deeply about many popular optimizers. Hence, closing the gap between practice and theory, which is the aim of this study, is a valuable task.
  • Lastly, we also want to mention that providing a better theoretical understanding of existing algorithms is important and popular among the optimization and even the broader ML community. To name a few:
    • SGD was invented last century [5]. Understanding its convergence theory remains a central topic among the optimization community. For example, its rate for nonconvex problems is not known until the seminal work [6], approximately 60 years after the birth of SGD. After another 10 years, [7] further explores the high-probability convergence of SGD and pushes forward the frontier of its theoretical study. Therefore, the way of showing new convergence theory for existing algorithms never ends.
    • Another popular algorithm, Adam, was introduced by [8] in 2014, perhaps the most famous variation of AdaGrad studied in this work. Even in recent years, many works are still trying to understand its convergence property for nonconvex optimization under various settings. For example, see [9, 10].
    • Going beyond optimization, let's take the classical mean estimation problem as an example, which plays an important role in different areas, e.g., ML/TCS/Statistics. The median of means estimator was introduced by [11] last century, but its theoretical research has remained popular in the past decades, as summarized in [12]. Even in 2025, the new study of it still shows up [13].

Based on the above points, we believe our work makes a solid contribution to the community.

First part of Questions. As described in our work, the optimal lower bound should be in the order of GDT+σDT1/pGD\sqrt{T}+\sigma DT^{1/\mathsf{p}}.

  • The first part GDTGD\sqrt{T} is the standard lower bound for deterministic OCO. For its proof, see, e.g., Section 5.1 of [14].
  • The second part is based on the reduction to stochastic convex optimization (SCO). Simply speaking, any lower bound for heavy-tailed SCO also applies to OCO. To see this fact, one can take t=F,t[T]\ell_t=F, \forall t\in[T]. Thus, standard lower bounds from heavy-tailed SCO can be translated to σDT1/p\sigma DT^{1/\mathsf{p}}. For its proof, see, e.g., Section 4 of [15], Theorem 3 of [16], or Section 5.3.1 of [11].
  • Combining the above two results, we have the lower bound GDT+σDT1/pGD\sqrt{T}+\sigma DT^{1/\mathsf{p}}. A discussion essentially the same as above can be found in Lines 89-95.

References.

[1] Liu, Z., & Zhou, Z. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. ICLR 2025.

[2] Zhang, J., & Cutkosky, A. Parameter-free regret in high probability with heavy tails. NeurIPS 2022.

[3] Cutkosky, A., et al. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. ICML 2023.

[4] Liu, L., et al. High-probability bound for non-smooth non-convex stochastic optimization with heavy tails. ICML 2024.

[5] Robbins, H., & Monro, S. A stochastic approximation method. The annals of mathematical statistics 1951.

[6] Ghadimi, S., & Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIOPT 2013.

[7] Liu, Z., et al. High probability convergence of stochastic gradient methods. ICML 2023.

[8] Kingma, D. P., & Ba, J. Adam: A method for stochastic optimization. ICLR 2015.

[9] Li, H., et al. Convergence of adam under relaxed assumptions. NeurIPS 2023.

[10] Wang, B., et al. Closing the gap between the upper bound and lower bound of Adam's iteration complexity. NeurIPS 2023.

[11] Nemirovskij, A. S., & Yudin, D. B. Problem complexity and method efficiency in optimization. 1983.

[12] Lugosi, G., & Mendelson, S. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics 2019.

[13] Møller Høgsgaard, M., & Paudice, A. Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means. ICML 2025.

[14] Orabona, F. A modern introduction to online learning. arXiv:1912.13213.

[15] Vural, Nuri Mert, et al. Mirror descent strikes again: Optimal stochastic convex optimization under infinite noise variance. COLT 2022.

[16] Raginsky, M., & Rakhlin, A. Information complexity of black-box convex optimization: A new look via feedback information theory. Allerton 2009.

评论

Thanks to the authors for the detailed response. I have no further concerns and intend to raise my score after the discussion with other reviewers and AC.

评论

We are pleased to see that the reviewer's concerns have been addressed and thank the reviewer for the positive comments.

审稿意见
3

This paper studies the online convex optimization (OCO) where the observed gradient on the cost functions have a heavy-tailed noise. Firstly, this paper showed regret bounds for well-established algorithms such as online gradient descent and Adagrad without any modification to these algorithms. Next, this works showed how the regret bounds can be applied to problems such as non-convex optimization with heavy-tailed noise.

优缺点分析

The overall writing of this paper is solid, especially for the first half. The authors did a good job at carefully stating their claims and the associated assumptions. I carefully read the proofs in Appendix A and C -- they are easy to follow and contain no errors.

The heavy-tailed noise setting is generally under-explored, so I am glad to see that the authors were able to offer clear and concise analysis in this setting. However, I feel that the technical contributions are insufficient, as most of the techniques used in this paper are quite standard and do not deviate from the standard approaches too much.

In particular, I find that the analysis between lines 136-153 and 192-204 to be uninteresting because they are mostly applying standard techniques. I feel that the space could be re-directed towards Section 4, where the discussion there was quite rushed. I think this Section could benefit from extended discussion on the problem setup from [5], especially Definition 1 and the O2NC algorithm. A literature survey is also needed to bring more context for the readers. And if you want to include a proof sketch in the main body, it should go towards Theorem 4 and/or 5. Currently, Section 4 is rushed and its quality of writing is noticeably lower than the earlier sections. Because I feel that the results in Section 3 (Thm 1-3) are lacking technical sophistication, devoting more space to Section 4 may help improve the significance of this paper. Right now, I recommend a borderline reject, where I may change my assessment if the author could make a more compelling argument for the technical contributions in Section 4.

For the author's rebuttal, I am mainly interested in seeing the following:

  1. A more extensive literature review. I don't believe they are only 2-3 pieces of related works.
  2. Removal of the "standard analysis" in Section 3, or at least move them to the appendix.
  3. A detailed break down of Definition 1 and the notion of K-shifting regret.
  4. Outlines of the technical contributions of this paper over [5], specifically the analysis around Algorithm 4.
  5. More discussions around Theorems 4 and 5.

Lastly, some technical issues:

  1. On line 140, isn't the inequality due to the non-expansiveness property of convex projection?
  2. On line 578, what does the first sentence mean? Seems to be a typo.
  3. On line 582, the inequality is due Jenson's inequality, not Holder.
  4. On line 231, I don't see why this rate would be considered sub-optimal? Your rate also contains a T(11/p)T^{-(1-1/p)} term.
  5. For Algorithm 4, I think the output of the subroutine A\textsf{A} would the displacement, not the location of the next step? If so, the notation should be clarified.

问题

see above

局限性

see above

最终评判理由

While this paper showed some promise in resolving a novel technical question, its contribution is diminishing by disorganized presentation of the results. Despite that I provided specific course of action to address these issues, the authors were unwilling to consider most of my suggestions.

As a result, the concerns raised in my original review still stand and I will maintain my scores.

格式问题

None

作者回复

We thank the reviewer for the detailed feedback. Below we address the reviewer's concerns.

Technical Contribution.

  • We believe our technical ideas are novel and insightful (as also commented by Reviewers pPe8 and crhG) since they have not been proposed in the OCO literature before and could be extended to broader situations, e.g., Online Mirror Descent (see also our response to W2&Q1 for Reviewer crhG). Based on our ideas, we provide the first finite regret bounds for OGD/DA/AdaGrad and, remarkably, they are entirely optimal in all parameters, which has never been achieved in heavy-tailed OCO. Therefore, these new ideas address an almost empty part of the literature. Due to the above, we believe they contain sufficient technical contributions to the community, regardless of whether the ideas themselves are simple or not (indeed, we believe the simplicity is another advantage of our work, see below).

  • The contribution of our techniques can also be evident in the results in Section 4. Our new bounds for both nonsmooth convex and nonconvex optimization are better than existing works. These improvements are again due to our novel techniques.

  • Of course, our ideas are simple, but we view this simplicity as particularly valuable in today's theoretical research. One of the ultimate goals of a theoretical study, we believe, should help people understand/explain the algorithms, which is also the motivation for our work as asked in Line 53. Technical sophistication could be a part of theoretical research, but it is only a means, not the destination. Especially in the case where new but simple ideas are sufficient and effective like our work, technical complexity is unnecessary and should be avoided, in our opinion.

Organization (Related to W2).

  • From a reader's perspective, especially one not familiar with theory, we believe the analysis in Section 3 is still necessary. Though it doesn't deviate from the standard way too much, it contains new technical ideas that we want to convey, and its structure is carefully designed for those knowing little about this area. It may seem redundant for experts like reviewers, but considering the broad audience of NeurIPS, we think it is still valuable.

  • For Section 4, as its title suggests, all about applications based on the new regrets shown in Section 3. With no doubt, it contains new and interesting theoretical results, but we emphasize that heavy-tailed OCO is the primary focus of this work, then applications. Moreover, because Section 4 requires some external knowledge for readers (e.g., Lemma 1 or the basics of Section 4.1), it can be recognized as an advanced topic. Hence, we only give it necessary but less weight than Section 3 to prevent overburdening readers. Lastly, we kindly remind the reviewer that Appendix E includes more background, discussions, and details of Section 4.1.

  • In summary, the current organization is to balance reader-friendliness and depth. But as suggested by the reviewer, we would try to condense Section 3 and, if more room is available, expand Section 4.

For other weaknesses:

W1. We are not sure what the reviewer meant by this point. For both Sections 4.1 and 4.2, we have included different related works in sentences, e.g., Lines 219, 220, 228, 229, 256, 264, 265, 267, 273. If the reviewer refers to the related works on heavy-tailed OCO, then [1] is the only work we know so far. If the reviewer thinks we missed any particular related work, we are happy to discuss and add it in the revision.

W3. For both Definition 1 and the concept of KK-shifting regret, we borrow it from [2] without modification and point the reader to the corresponding place in [2]. We currently do this because:

  • Definition 1 itself was also evolved from works even earlier than [2], as stated in Section 2.1 of [2]. Providing every detail is hence not tractable and could distract the reader from the central topic of our work.

  • The KK-shifting regret appears naturally in the proof of Theorem 7 of [2] (or Lemma 2 in our paper), which we invoke as a black box. Thus, the KK-shifting regret is also presented in a black-box way.

As per the reviewer's suggestion, we plan to add some information on these two notions in the revision to improve the readability.

W4&W5. The main innovation of our work over [2] is showing 1. Theorem 4: O2NC is effective under heavy-tailed noise without any modification; 2. Theorem 5: Three concrete options for A\mathsf{A} in O2NC are given with provable rates, based on new regrets shown in Section 3.

  • For the first point, the proof in [2] only works for p=2\mathsf{p}=2. Specifically, a key step to analyze O2NC is bounding E[n=1Tϵ_n]\mathbb{E}\left[\left\Vert\sum_{n=1}^T\boldsymbol{\epsilon}\_n\right\Vert\right]. If p=2\mathsf{p}=2, [2] applies Hölder’s inequality to obtain E[n=1Tϵ_n]E[n=1Tϵ_n2]σT\mathbb{E}\left[\left\Vert\sum_{n=1}^T\boldsymbol{\epsilon}\_n\right\Vert\right]\leq\sqrt{\mathbb{E}\left[\left\Vert\sum_{n=1}^T\boldsymbol{\epsilon}\_n\right\Vert^2\right]}\leq\sigma\sqrt{T}, where the last step is by E[ϵ_n,ϵ_m]=0,nm\mathbb{E}\left[\left\langle \boldsymbol{\epsilon}\_n,\boldsymbol{\epsilon}\_m\right\rangle\right]=0,\forall n\neq m and E[ϵ_n2]σ2\mathbb{E}\left[\left\Vert\boldsymbol{\epsilon}\_n\right\Vert^2\right]\leq\sigma^2. However, this derivation fails immediately whenever p<2\mathsf{p}<2, as E[ϵ_n2]\mathbb{E}\left[\left\Vert\boldsymbol{\epsilon}\_n\right\Vert^2\right] can be ++\infty now. Thus, we have to bring up an alternative to bound E[n=1Tϵ_n]\mathbb{E}\left[\left\Vert \sum_{n=1}^T\boldsymbol{\epsilon}\_n\right\Vert\right]. Fortunately, an existing inequality can be borrowed to deal with the case (see Lemma 3 in the paper). As such, the proof of Theorem 4 certainly differs from [2]. Due to the limited space, we cannot expand everything here and kindly refer the reviewer to Appendix E.2 for more details.

  • The second point is also important. Because, to apply Theorem 4, we still need to find an OCO algorithm having a finite KK-shifting regret in expectation under heavy tails. However, whether such an algorithm exists and, if so, what it is remains unclear, as, again, everything in [2] is only for p=2\mathsf{p}=2 and hence cannot be directly applied in our setting. Luckily, due to our new regret bounds shown in Section 3, we can confidently set A\mathsf{A} to be OGD/DA/AdaGrad (though the KK-shifting regret slightly differs from the standard regret, results in Section 3 are enough, see Lines 668-672 for details). In other words, the proof of Theorem 5 builds on top of Theorem 4 and Section 3, both of which cannot be obtained by [2].

  • In summary, both Theorems 4 and 5 cannot be concluded from [2], and we prove them in a new way. From the results side, they generalize the case p=2\mathsf{p}=2 in [2] to 1<p21<\mathsf{p}\leq2.

Remark. Due to character limitations for rebuttal, we also invite the reviewer to go through Appendix E, which includes more information about Section 4.1, particularly Theorems 4 and 5.

For other issues mentioned by the reviewer, we believe none of them have any technical flaws and would like to clarify them below.

Issue 1. It holds by x_t+1=ΠX(x_tη_tg_t)=argmin_xX12xx_t+η_tg_t2(a)x_t+1x_t+η_tg_t,x_t+1x0,xX\boldsymbol{x}\_{t+1}=\Pi_{\mathcal{X}}(\boldsymbol{x}\_t-\eta\_t\boldsymbol{g}\_t)=\mathrm{argmin}\_{\boldsymbol{x}\in\mathcal{X}}\frac{1}{2}\left\Vert\boldsymbol{x}-\boldsymbol{x}\_t+\eta\_t\boldsymbol{g}\_t\right\Vert^2\overset{(a)}{\Rightarrow}\left\langle\boldsymbol{x}\_{t+1}-\boldsymbol{x}\_t+\eta\_t\boldsymbol{g}\_t,\boldsymbol{x}\_{t+1}-\boldsymbol{x}\right\rangle\leq0,\forall\boldsymbol{x}\in\mathcal{X}, where (a)(a) is by the optimality condition (i.e., y=argminxXh(x)h(y),yx0,xX\boldsymbol{y}=\mathrm{argmin}_{\boldsymbol{x}\in\mathcal{X}}h(\boldsymbol{x})\Rightarrow\left\langle\nabla h(\boldsymbol{y}),\boldsymbol{y}-\boldsymbol{x}\right\rangle \leq0,\forall\boldsymbol{x}\in\mathcal{X} for any convex X\mathcal{X}). We then obtain the corresponding inequality after rearranging terms. We didn't see a way to conclude it directly from the non-expansive property of convex projection, but we are happy to discuss it with the reviewer.

Issue 2. The description “1/η01/\eta_0 should be read as 00” is to guarantee the inequality above Line 578 is well-defined. Concretely, we need to make sure that when t=1t=1, the notation 1/η01/\eta_0 is well-defined in every step above Line 578. Therefore, this is not a typo.

Issue 3. We remark that the corresponding inequality can also be reasoned from Hölder’s inequality: for any random variable X>0X>0, there is E[X1/p](a)(E[X])1/p(E[1p_])1/p_=(E[X])1/p\mathbb{E}\left[X^{1/\mathsf{p}}\right]\overset{(a)}{\leq}\left(\mathbb{E}\left[X\right]\right)^{1/\mathsf{p}}\left(\mathbb{E}\left[1^{\mathsf{p}\_\star}\right]\right)^{1/\mathsf{p}\_\star}=\left(\mathbb{E}\left[X\right]\right)^{1/\mathsf{p}}, where (a)(a) is due to Hölder’s inequality and p_=p/(p1)\mathsf{p}\_\star=\mathsf{p}/(\mathsf{p}-1) is the conjugate of p\mathsf{p}. Let X=t=1TϵtpX=\sum_{t=1}^{T}\left\Vert \boldsymbol{\epsilon}_t\right\Vert ^{\mathsf{p}} recover the inequality under Line 582.

Issue 4. We call it suboptimal since it cannot automatically recover the optimal rate GD/TGD/\sqrt{T} in the deterministic case (i.e., σ=0\sigma=0). In other words, the optimal rate should be in the form of GD/T+σD/T11/pGD/\sqrt{T}+\sigma D/T^{1-1/\mathsf{p}} as achieved by our work (σp\sigma^{\mathsf{p}} in both Corollaries 1 and 2 is a typo, we have fixed it to σ\sigma). A similar discussion can be found in Lines 220-222.

Issue 5. The output of A\mathsf{A} can be viewed as a displacement, and we have never referred to it as the location of the next step. We use xt\boldsymbol{x}_t for consistency with the previous sections, and in particular with the notation system specified in Section 2 (especially, Remark 2).

References.

[1] Zhang, J., & Cutkosky, A. Parameter-free regret in high probability with heavy tails. NeurIPS 2022.

[2] Cutkosky, A., et al. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. ICML 2023.

评论

I thank the author for their very detailed response. I am overall satisfied with the technical correctness of this work, but I still have major concerns about the presentation.

While the author argued that they want to keep the technical discussions in Section 3 approachable, the current writing of Section 4 is very dense would certainly scare away any of those readers who are not very familiar with the literature. While the author's responses to W4 and W5 are a step in the right direction, they are not quite sufficient to bridge the gap in the sudden change in style.

Also, I am not convinced that having the results in Section 3 as the sole primary contribution would be enough. While the results in this section are new to the literature, they are following too closely to the standard techniques. In my opinion, Section 4 brings the technical sophistication that significantly strengthens the overall paper. But I am not happy to hear that the author seems to be dismissive of my perspective.

Overall, I feel that the author is being unneccesarily defensive about their current course. Unless I can see a more concrete action plan that addresses my concern regarding the organization, I will keep my score as-is.

评论

We thank the reviewer for the further comments. There is one point we would like to clarify. We believe the goal of the review/rebuttal process is to clear up misunderstandings and exchange different opinions to improve the paper. From this perspective, we think it is also important to elaborate on the author's mindset behind the paper's organization. Accordingly, we provide a detailed explanation of the current structure of our paper, particularly about Sections 3 and 4.

We also, of course, take into account the reviewer's comments seriously, as mentioned in the rebuttal: We would try to condense Section 3 and, if more room is available, expand Section 4. Therefore, we don't think our rebuttal is dismissive of the reviewer's review or an unnecessary defense, and never had such intention.

We hope this addresses the reviewer's concern, and we are always happy to discuss with the reviewer if there is anything not clear.

评论

I believe that one desired outcome of the rebuttal discussion to improve the paper. Therefore, my original review listed 5 areas of improvement. I wasn't expecting the author to address all of the points, but I was planning to raise my score if the author could show concrete progress on at least 3 of them.

Unfortunately, the author's rebuttal only properly addressed my 4th point while shown unwillingness to make meaningful changes to the others. In particular, the author's claim that they want to make Section 3 more accessible directly contradict the fact that their Section 4 is very dense and skipped on many important background information. Since the author ignored my repeated plea to resolve these issues, I have to maintain my current score.

评论

We thank the reviewer for the further comments. However, we don't understand why the reviewer asserts that we ignored the reviewer's comment and showed unwillingness. We provide a detailed explanation below.

For W1, as in our rebuttal:

We are not sure what the reviewer meant by this point. For both Sections 4.1 and 4.2, we have included different related works in sentences, e.g., Lines 219, 220, 228, 229, 256, 264, 265, 267, 273. If the reviewer refers to the related works on heavy-tailed OCO, then [1] is the only work we know so far. If the reviewer thinks we missed any particular related work, we are happy to discuss and add it in the revision.

We are not even sure what the reviewer means. Could the reviewer clarify this point? Moreover, we also stated that we are happy to add any related references if the reviewer thinks we have missed.

For W2, as in our rebuttal:

In summary, the current organization is to balance reader-friendliness and depth. But as suggested by the reviewer, we would try to condense Section 3 and, if more room is available, expand Section 4.

As explained previously, we first summarized the behind mindset of the current structure, but second, we stated that we will try to condense Section 3 and expand Section 4.

For W3, as in our rebuttal:

As per the reviewer's suggestion, we plan to add some information on these two notions in the revision to improve the readability.

Hence, we also directly followed the point mentioned by the reviewer.

For W5, we have answered it with W4 together. If the reviewer thinks any point in that discussion about Theorems 4 and 5 is inadequate/unclear, please let us know, and we are happy to answer in more detail.

We hope the above explanation addresses the reviewer's concern, and we are always happy to discuss with the reviewer if there is anything unclear.

审稿意见
5

The aim of the paper is to provide regret guarantees for well-known algorithms, such as online gradient descent and AdaGrad, in the presence of heavy-tailed noise. The key reason these guarantees are possible lies in the paper’s setup and assumptions:

  1. The constraint set is compact.

  2. The results are established in expectation.

Under these assumptions, the paper proves regret bounds for plain online gradient descent and AdaGrad without requiring gradient clipping or robust gradient estimator

优缺点分析

Strengths:

  1. The paper is extremely well written and a pleasure to read.

  2. The technical aspects of the paper are easy to follow yet insightful.

  3. The paper provides a fair comparison with the existing literature.

Weakness:

  1. Perhaps the main weakness is that the authors do not discuss whether Online Gradient Descent and AdaGrad can perform reasonably in practice when the constraint set is unbounded.

  2. The authors do not address mirror descent-type algorithmsl.

[1] Jiujia Zhang and Ashok Cutkosky. Parameter-free regret in high probability with heavy tails.

问题

  1. Why did the authors consider Dual Averaging but not Regularized Dual Averaging? Is there an additional challenge introduced by the presence of the regularizer?

  2. Could a similar line of proof be applied to algorithms that adapt to the strong convexity parameter, such as MetaGrad?

  3. Could the authors comment on whether similar approaches could be extended to second-order methods?

局限性

I address the limitations that I have in mind in the previous sections..

最终评判理由

I believe the paper’s contribution is both neat and solid. I have read the authors’ rebuttal and the concerns raised by other reviewers, and considering all of these, I am maintaining my score.

格式问题

I have no formatting concerns.

作者回复

We thank the reviewer for the appreciation of our work and the proof technique. Below we address the reviewer's concerns.

W1. Thanks for bringing this point. It is worth noting that, in reality, people simply run SGD (i.e., OGD for stochastic optimization) and AdaGrad (or its successor, say, Adam) on Rd\mathbb{R}^d without any projection step for many tasks that have heavy tails, e.g., training LLMs. It is widely known that these algorithms perform well in practice. Hence, as discussed in Lines 47 to 52, their surprisingly effective performance highly motivates this work. In addition, we believe that the popularity of these algorithms is sufficient to reflect their good practical performance when the constraint is unbounded. Thus, we currently do not discuss much on this, but we are happy to add more if there is any point that the reviewer thinks particularly valuable. Of course, our current theory is inadequate to explain their success in this unbounded setting, which we would like to leave as future work, as mentioned in Section 5.

W2&Q1. Indeed, our work can generalize to Mirror Descent/Regularized Dual Averaging with almost no additional effort. More concretely, taking the proof for OGD as an example, the proof in Lines 136-153 does not involve any specific property of the Euclidean norm. Therefore, based on almost the same idea, one familiar with the proof of Online Mirror Descent (OMD) (even in the presence of a regularizer) can extend the analysis to OMD directly. In fact, we have also briefly mentioned this fact in Remark 1 (Lines 104-105), and are happy to elaborate more on this point in the revision when more space is permitted. Similarly, our core idea also works for Regularized Dual Averaging.

However, considering NeurIPS has a broad audience including those who may not be experts in optimization theory, we finally decided to keep it simple in the current form, as it already contains many results for the reader to digest. Moreover, the currently presented results can directly relate to SGD (i.e., OGD for stochastic optimization), which we believe could help the reader immediately catch one of the key messages conveyed by our work, that is, gradient clipping may be unnecessary for heavy-tailed noise.

In summary, showing more general results based on our new proof idea is doable, we finally chose the current form to balance depth and reader-friendliness of the work.

Q2. This is an insightful question. Adapting to the strongly convex parameter is an interesting topic that is worth studying. However, we currently don't think it could be easily achieved based on existing methods. This is because, as far as we know, all the existing algorithms that can achieve this desired property assume that gt\boldsymbol{g}_t is bounded by a constant GG almost surely (or a time-varying constant GtG_t), including the MetaGrad algorithm mentioned by the reviewer. However, in our work, we only assume that E[gt]\mathbb{E}\left[\boldsymbol{g}_t\right] is bounded. These two conditions are essentially different, i.e., a random variable that has a finite mean value (regardless of whether it is heavy-tailed or not) doesn't imply it is almost surely bounded. Due to this, adapting to the strongly convex parameter based on existing algorithms would fail.

Moreover, even considering p=2\mathsf{p}=2, the classical finite variance condition, we do not know of any OCO algorithm that can adapt to the strongly convex parameter. Therefore, achieving this point may not be an easy task, which we hope can be addressed in the future.

Q3. Thanks for the interesting question. Honestly, we are not familiar with second-order methods on either the theoretical or practical side. Therefore, to not mislead the reviewer, we can only answer that we don't know. But one point we would like to say is that the core inequalities (3), (4), and (5) hold independent of any OCO/stochastic optimization algorithm. Therefore, extending to second-order methods may be possible.

评论

I wish to thank the authors for their responses. I have read all the comments from the other reviewers as well as the authors’ replies. I believe we have reached a point where a discussion between the AC and the reviewers is needed. Based on what I have read and my current assessment of the paper, I am keeping my score unchanged for now.

评论

We thank the reviewer for the further comments. We always keep open to discussing with the reviewer and are happy to address any potential questions in the future.

审稿意见
6

This paper proves convergence rates in online convex optimization in the setting where the gradient distribution has a heavy tail, for usual algorithms like OGD, DA etc, i.e. without gradient clipping or extra tricks. More precisely, the assumptions that are tackled are stochastic gradients with a finite p-th central moment, with p(1,2]p \in (1, 2]. Interestingly, these bounds are optimal in all parameters, that is, they don’t require the knowledge of p, which shows that gradient clipping is not even needed to achieve this. Finally, the authors show such bounds for many variants of OGD including DA and AdaGrad, and they also tackle the setting of nonsmooth convex and nonconvex optimization with heavy tails.

优缺点分析

Strengths

This paper's results addresses an almost empty part of the literature which is the convergence of stochastic online convex optimization algorithms (which is more general than vanilla stochastic convex optimization): as they mention in the paper, one paper tackled that setting before, but using quite complex algorithms with gradient clipping and other tricks. Interestingly this paper shows that simple (stochastic) online gradient descent algorithms are already optimal, and for instance do not need gradient clipping.

I think the proof technique are quite neat, based on plugging lesser known tighter inequalities in the well known proof for OCO, and then showing convergence with the heavy tailed-noise, and I think this may be useful to the community as some classical proof to build upon.

Also, to mention a few of the features of their results, the authors extend their work to a variety of algorithms (OGD, DA, AdaGrad), and settings (nonsmooth convex and non-convex optimization with heavy tails). Notably, the result for non-smooth convex optimization with heavy tail applies not just to the average of the iterates but also to the last iterate (which is interesting since such last iterate results usually have some interesting explanatory power for optimization algorithms used in modern ML applications like LLMs, which usually consider the last iterate).

Finally, although theoretical, I even think that the results proven in the paper could even be interesting for modern large language models training. Indeed, large language model training, by some aspects, resemble more online (stochastic) learning than vanilla stochastic gradient descent, as trainings are usually over only one epoch of data which is very long (and any data is not seen twice), and the data mixture (hence data distribution) can be made to evolve over training. And up to my knowledge, there were many papers recently that tried to explain the success of algorithms with gradient clipping for LLMs by arguing that LLMs gradients distribution are usually heavy-tailed. Therefore, this paper may make the community think deeper about whether gradient clipping is fundamentally needed, even for LLMs training.

Overall I think this paper contains a lot of results useful and interesting to the community; it feels also engaging and nice to follow.

Weaknesses

I feel that the motivation for studying heavy-tailed noise gradients is well explained (intrinsic motivation of filling a gap in the literature, studying more deeply existing algorithms, and also explaining recent experimental results), but it could perhaps be made even stronger by adding that, as mentioned above, heavy tailed noise is often present in LLMs, and LLMs training is a setting which closely resemble OCO, and OCO-like result, especially last-iterate type of results, can have a good explanatory power for the training behavior of LLMs, see e.g. [1].

(but maybe this would just be a distracting addition to the paper which already has clear motivations, by just adding some hand wayvy extra motivations)

[1] The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training, Shaipp et al.

问题

In 1.2, you mention that the regret of [47] matches their lower bound, but does not match the standard T\sqrt T bound in the deterministic case: could you elaborate slightly as I am not sure I understand: does that mean that their lower bound becomes inapplicable in the case where σ=0\sigma=0 ?

局限性

no specific limitation

最终评判理由

As mentioned I am recommending this paper as I think it provides an interesting result since gradient clipping is often associated to dealing with heavy-tailed noise. This paper shows that it may actually not be the case (if one considers online (constrained) convex optimization). Moreover I think the paper is well written and clear, and its proofs can be interesting to the community.

格式问题

no formatting issue

作者回复

We thank the reviewer for the endorsement of our work. Below we address the reviewer's concerns.

Weakness. We thank the reviewer for suggesting the potential relationship between our work and other broader areas that we were not aware of before. We are happy to read, think about them, and add some discussions in the revision if more space is allowed.

Question. To clarify, [1] only proved an upper bound (G+σ)DT1/p(G+\sigma)DT^{1/\mathsf{p}} (up to extra logarithmic factors) and did not prove any lower bound. In the deterministic case (i.e., σ=0\sigma=0), their upper bound is still applicable and will degenerate to GDT1/pGDT^{1/\mathsf{p}}, which is however far from the optimal lower bound GDTGD\sqrt{T} for deterministic OCO. In other words, their bound is not entirely optimal. Our discussion in Section 1.2 aims to bring this issue to the reader's attention.

References.

[1] Zhang, J., & Cutkosky, A. Parameter-free regret in high probability with heavy tails. NeurIPS 2022.

评论

Dear authors, thanks a lot for your answer, your rebuttal addresses my concerns and as such I will maintain my positive review of your paper.

评论

We thank the reviewer for the further feedback and appreciate the supportive comment.

最终决定

(a) Summary

This paper investigates online convex optimization (OCO) with heavy-tailed stochastic gradient noise. The key contribution of this paper lies in establishing expected regret bounds for several classic OCO algorithms, including OGD, DA and AdaGrad. These methods obtain optimal regret bounds without requiring algorithmic modifications such as gradient clipping. Furthermore, the authors apply their results to the nonsmooth convex optimization and nonsmooth nonconvex optimization settings under heavy-tailed noise.


(b) Strengths

The theoretical results demonstrate that OGD, DA and AdaGrad, without gradient clipping, can achieve optimal regret bounds in OCO with heavy-tailed noise. The authors further extend their results into nonsmooth convex optimization and nonsmooth nonconvex optimization by using online-to-batch conversion and online-to-non-convex conversion, respectively.


(c) Weaknesses

The weaknesses of this paper can be summarized below:

  • The technical contributions are limited, because the paper does not introduce new algorithms and relies on existing techniques. Most of the proofs follow standard analyses. Moreover, providing theoretical guarantees under heavy-tailed noise without gradient clipping has already been explored in prior work [1], which limits the novelty of this work.
  • The extensions to nonsmooth convex and nonsmooth nonconvex optimization do not seem to introduce significant technical challenges. For nonsmooth convex optimization, the results are obtained mainly through the online-to-batch conversion. For nonsmooth nonconvex optimization, the method design relies heavily on the online-to-non-convex conversion.
  • The claim of optimality with respect to all parameters is not fully supported, since the paper does not provide lower bounds.

(d) Reasons for rejection

The primary reason is that the technical contributions are insufficient and the core novelty is limited. The main contribution of the paper is to derive theoretical guarantees for OCO under heavy-tailed noise without gradient clipping. However, similar results have already been established in prior work. In addition, the authors do not provide a lower bound to support the optimality of their regret bounds.


(e) Discussion and rebuttal

I summarize below the discussions and changes between the four reviewers and the authors.

Reviewer pPe8's concerns focus on the presentation of the motivation for studying heavy-tailed noise gradients and on some doubts regarding the limitations of the prior work. The authors clarify both the motivation and the limitations of prior work in their response.

Reviewer crhG raises concerns regarding the bounded domain assumption. In practice, domains are often unbounded. The authors reply that although the analysis is conducted under the boundedness assumption, the methods would perform well without it. Moreover, the reviewer also notes that the paper does not explicitly address mirror descent type algorithms. The authors respond that the proofs of the work can be extended to the online mirror descent framework directly.

Reviewer gH1H's concerns center on the limited technical contributions, and the presentation of the paper, including issues with the literature review, the contribution over [2], the definition of KK-shifting regret. This reviewer also points out that the paper does not introduce new algorithms and relies heavily on standard techniques. Moreover, the presentation of this paper is dense and difficult to follow. The authors' reply fails to address the reviewer's concerns.

Reviewer mzro questions the core contribution of the paper. They remark that establishing guarantees under heavy-tailed noise without gradient clipping has already been investigated in prior work [1], which limits the novelty of this work. In addition, although the authors claim that their regret bounds are optimal, they do not provide a lower bound. The additional analysis provided by the authors during the rebuttal period does not fully address this concern.

After the rebuttal period, Reviewers pPe8 and crhG maintain their positive scores as their concerns are addressed. In contrast, the concerns of Reviewers gH1H and mzro are not fully resolved, and they maintain their negative scores.

I place greater weight on the evaluations of Reviewer gH1H and Reviewer mzro, as their points highlight the limitations in both the significance and novelty of the work.


References

[1] Zijian Liu and Zhengyuan Zhou. Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping. ICLR 2025.

[2] Ashok Cutkosky, Harsh Mehta and Francesco Orabona. Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion. ICML 2023.