6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

4.0

置信度

正确性3.3

贡献度2.7

表达3.0

NeurIPS 2024

Universal Online Convex Optimization with $1$ Projection per Round

Wenhao Yang,Yibo Wang,Peng Zhao,Lijun Zhang

OpenReview PDF

提交: 2024-05-10更新: 2024-12-19

摘要

To address the uncertainty in function types, recent progress in online convex optimization (OCO) has spurred the development of universal algorithms that simultaneously attain minimax rates for multiple types of convex functions. However, for a $T$-round online problem, state-of-the-art methods typically conduct $O(\log T)$ projections onto the domain in each round, a process potentially time-consuming with complicated feasible sets. In this paper, inspired by the black-box reduction of Cutkosky and Orabona [2018], we employ a surrogate loss defined over simpler domains to develop universal OCO algorithms that only require $1$ projection. Embracing the framework of prediction with expert advice, we maintain a set of experts for each type of functions and aggregate their predictions via a meta-algorithm. The crux of our approach lies in a uniquely designed expert-loss for strongly convex functions, stemming from an innovative decomposition of the regret into the meta-regret and the expert-regret. Our analysis sheds new light on the surrogate loss, facilitating a rigorous examination of the discrepancy between the regret of the original loss and that of the surrogate loss, and carefully controlling meta-regret under the strong convexity condition. With only $1$ projection per round, we establish optimal regret bounds for general convex, exponentially concave, and strongly convex functions simultaneously. Furthermore, we enhance the expert-loss to exploit the smoothness property, and demonstrate that our algorithm can attain small-loss regret for multiple types of convex and smooth functions.

关键词

Online Convex OptimizationUniversal Online LearningProjection

评审与讨论

审稿意见

评分: 7置信度: 42024-06-16

This paper introduces methods for constrained OCO which automatically achieve the optimal rate without knowing in advance whether the losses are convex, strongly convex, exp-concave, or smooth, while using only 1 projection per round. This is notable because the standard approach proceeds by combining several expert algorithms with a meta algorithm; in constrained settings these expert algorithms require implementing a potentially expensive projection. This work avoids projecting each of the expert algorithm iterates leveraging the constrained-to-unconstrained reduction of Cutkosky 2020.

优点

The paper addresses a clear and real problem that has been left unaddressed by the majority of literature on this topic. The approach is a pretty straight-forward modification of existing reductions, but uses them in a new and unexpected way.

缺点

The main weakness is that the paper feels poorly factored. There is a very large number of back references to previous equations, and the paper would be very hard to read in print. To actually follow the math, it's almost necessary to read the paper with a pdf viewer which can display pop-up previews when hovering over links. I think this would be remedied by better factoring the results into lemmas and propositions.

As noted, the approach is a fairly straight-forward modification of the results from Cutkosky & Orabona (2018) and Cutkosky (2020), and essentially boils down to not dropping negative terms in the analysis, and then exposing matching terms in the regret decomposition. I think this is fine overall; these considerations are missing from the literature, and this is a fitting place for them to enter the literature.

问题

Do you think there could possibly be a more abstract way to formalize these universal algorithms? The strange thing about these universal algorithms is that a whole new algorithm seemingly needs to be devised every time one wants to incorporate a new kind of loss (e.g. meta-grad -> mahler to handle strongly convex losses). The ideal result would more generally be a reduction which just passes the losses to each of the experts, maybe with some additional side-information, and lets them construct whatever surrogate loss they want with it. In this way there might just be one "final" paper on universal guarantees.

局限性

Yes, the paper points out in the conclusion that the bounded domain / gradient assumption is a significant limitation that they hope to address in future work.

作者回复

2024-08-04

Many thanks for the constructive reviews! We will revise our paper accordingly.

Q1: The main weakness is that the paper feels poorly factored.

A1: We apologize for the confusion caused by the numerous references to previous equations in this paper. Due to page limit, we attempted to include all the math equations of our key ideas in this paper, which made it difficult to follow in the printed version. We promise to polishing our writing and will follow your suggestion to factor the results into lemmas and propositions.

Q2: Do you think there could possibly be a more abstract way to formalize these universal algorithms?

A2: Yes, there do exist a general framework in the studies of universal online learning that allows expert-algorithms to process the original functions [Zhang et al., 2022]. We are pleased to briefly describe the basic idea of their method here. Following the two-layer structure, the regret can be decomposed to the meta-regret and the expert-regret, that is,

\sum_{t=1}^T f_t(x_t) -\sum_{t=1}^T f_t(x) = \underbrace{\sum_{t=1}^T f_t(x_t)-\sum_{t=1}^T f_t(u_t)}\_{`meta-regret`} + \underbrace{\sum_{t=1}^T f_t(u_t)-\sum_{t=1}^T f_t(x)}\_{`expert-regret`}

where $x_t$ and $u_t$ denote the output of the meta-algorithm and expert in the $t$ -th round. To bound the expert-regret, we can directly use existing online algorithms as the expert-algorithms that achieve optimal regret bounds for different types of convex functions, e.g., $O(\sqrt{T})$ and $O(\log T)$ . To control the meta-regret, their meta-algorithm employs the linearized loss, i.e., $l_t(x)=\langle \nabla f_t(x_t), x-x_t\rangle$ , to measure the performance of each expert. Furthermore, they also require the meta-algorithm to yield a second-order bound in terms of $l_t(\cdot)$ . In this way, the meta-regret is small for exp-concave functions and strongly convex functions, i.e., $O(\log\log T)$ , and is also tolerable for convex functions, i.e., $O(\sqrt{T})$ . By incorporating existing online algorithms as experts, their approach inherits the regret of any expert designed for strongly convex functions and exp-concave functions, and also obtains minimax optimal regret for convex functions. We feel that this framework is close to the "final" paper on universal guarantees that you mentioned.

We acknowledge that our work indeed follows this universal framework. However, to address new problems, we have to make essential modifications to this framework. In this paper, we focus on reducing the projection complexity of universal online learning, and have made innovative technical contributions based on this framework. These include specially designed expert-losses for expert-algorithms that handle strongly convex functions, and novel theoretical analysis that advances universal algorithms and the black-box reduction.

2024-08-08

Great! Thanks for the detailed explanation of Zhang 2022, this does indeed seem to be the kind of result I was hoping for

I believe the authors will be able to factor the results and polish the writing between now and the camera-ready. I think these are important results and the approach is quite nice, so I've raised my score to a 7.

2024-08-08

Dear Reviewer VeB9,

Many thanks for your kind reply! We will further enhance our paper according to your constructive reviews.

Best regards,

Authors

审稿意见

评分: 6置信度: 52024-07-05

This paper addresses the challenge of online convex optimization with unknown smoothness properties of the loss functions, which can be convex, strongly convex, or exp-concave. The authors propose an algorithm that achieves regret bounds of order $\sqrt{T}$ , $\log T$ , and $d \log T$ respectively, while requiring only a single projection step per round on the original domain $\mathcal{X}$ . Such projections can indeed be computationally expensive. Additionally, the authors present regret bounds with improvment for small losses.

Most algorithms that achieve similar adaptive regret upper bounds rely on meta-algorithms that combine experts (running ONS or OGD with surrogate losses), inspired by the MetaGrad algorithm. Typically, these algorithms necessitate $\log(T)$ projection steps per round (one per expert), which can be computationally burdensome. Mhammedi et al. (2019) reduced this projection cost to $O(1)$ but at the expense of a $d \log T$ regret for strongly convex losses. To overcome this, the authors introduce new surrogate losses based on a black-box reduction technique by Cutkosky et al. (2018), which simplifies the constrained optimization problem on $\mathcal{X}$ to another domain, such as the Euclidean ball, where projections are easier.

优点

The paper is well-written and offers valuable insights into the use of surrogate losses to adapt to strong convexity or exp-concavity. It may serve as a comprehensive entry point into the extensive literature on universal OCO algorithms.
Despite combining various existing techniques, the results are non-trivial and required solving technical challenges, especially for the strongly convex case. The authors introduce novel negative terms in the analysis to achieve their results.
Experiments included in the appendix demonstrate that the computational improvements can be significant.

缺点

The theoretical improvements may appear incremental, appealing to a niche audience interested. The improvement being only in the specific case of strongly convex $\log T$ regret with $O(1)$ projection steps. The primary high-level ideas in the algorithm and analysis are based on prior work.
The paper still relies on a meta-aggregation procedure, which, although theoretically effective, is not particularly elegant and maintains a per-round complexity of order $O(\log T)$ . Achieving $O(1)$ complexity per round seems however highly challenging.
The convex rate is actually $O(\sqrt{T \log\log T})$ , not $O(\sqrt{T})$ as stated in the results.

问题

The algorithm requires prior knowledge of the parameters $G$ and $T$ , would simple doubling trick allow to tune these?
Your algorithm still requires O(1) projection steps on $\mathcal{X}$ and O(log T) projection steps on $\mathcal{Y}$ . Do you think that projection free algorithms such as variants of Online Frank Wolfe could be used instead of OGD and ONS (up to deteriorating slightly the rate) to remove all projections (or at least the O(log T) on $\mathcal{Y}$ ) while still being adaptive to the smoothness?

局限性

The authors have addressed the limitations.

作者回复

2024-08-04

Many thanks for the constructive reviews! We will revise our paper accordingly.

Q1: Would simple doubling trick allow to tune prior knowledge of the parameters $G$ and $T$ ?

A1: In fact, the doubling trick enables our proposed algorithm to avoid the prior knowledge of $T$ , at the cost of an additional $\log T$ in the regret bound. However, this trick is inadequate for removing the requirement of the prior knowledge of $G$ , as the variability in function gradients is unknown. Therefore, designing lipschitz adaptive online learning algorithms is a highly challenging task and merits further exploration in the future.

Q2: Your algorithm still requires $O(1)$ projection steps on $\mathcal{X}$ and $O(\log T)$ projection steps on $\mathcal{Y}$ .

A2: We would like to clarify that, in contrast to the expensive projection operations onto the complicated domain, projection operations on the surrogate domain $\mathcal{Y}$ are much cheaper and can be considered negligible. In this work, we construct a ball $\mathcal{Y}=\\{x | \Vert x\Vert\leq D \\}$ as the surrogate domain. For the OGD algorithm, the projection operation of the reduced algorithm on $\mathcal{Y}$ can be realized by a simple rescaling, i.e., $y_t = \hat{y}_t$ if $\Vert \hat{y}_t\Vert \leq D$ ; otherwise $y_t=\hat{y}_t \cdot \frac{D}{\Vert \hat{y}_t\Vert}$ , where $\hat{y}_t$ denotes the unprojected decision and $y_t$ denotes the decision on the ball $\mathcal{Y}$ . For the ONS algorithm, there exists an efficient implementation of ONS in the literature, where the time-consuming projection operation can be replaced by a more efficient singular value decomposition [Lemma 7, Mhammedi et al., 2019]. After combining all the experts' decisions by a meta-algorithm, we project the solution in $\mathcal{Y}$ onto $\mathcal{X}$ , which is the only projection onto $\mathcal{X}$ per round.

Q3: Do you think that projection-free algorithms such as variants of Online Frank Wolfe could be used instead of OGD and ONS to remove all projections while still being adaptive to the smoothness?

A3: Thanks for your insightful comments! We can choose projection-free algorithms, such as variants of Online Frank Wolfe, as the expert-algorithms. However, given the current studies on projection-free algorithms (summarized in the table below), this approach may suffer a deterioration of the regret bound and can not handle certain cases.

More specifically, as is shown in the table, in the literature, there are no suitable projection-free algorithms for exp-concave functions, neither for strongly convex and smooth functions. Moreover, when functions are smooth, existing projection-free algorithms are unable to achieve problem-dependent bounds, such as the small-loss bounds in this work.

Algorithm	Condition on Loss	Regret Bound
OFW [1]	convex	$O(T^{3/4})$
OSPF [2]	convex and smooth	$O(T^{2/3})$
SC-OFW [3]	strongly convex	$O(T^{2/3})$
AFP-ONS [4]	exp-concave and smooth	$O(T^{2/3})$

Finally, we would like to highlight that although using projection-free algorithms can remove all projections, they may not achieve greater efficiency based on the universal framework. Specifically, most projection-free algorithms, such as OFW and its variants, replace the original projection operation with a linear optimization step. Since the universal framework requires maintaining $O(\log T)$ expert-algorithms, this approach needs to perform $O(\log T)$ linear optimization steps per round, which can be time-consuming when $T$ is large.

References:

[1] Hazan and Kale. Projection-free Online Learning. ICML, 2012.

[2] Hazan and Minasyan. Faster Projection-free Online Learning. COLT, 2020.

[3] Wan et al. Projection-free Online Learning over Strongly Convex Sets. AAAI, 2021.

[4] Garber and Kretzu. Projection-free Online Exp-concave Optimization. COLT, 2023.

2024-08-10

Thank you for your response. This addresses some of the points I raised. I have no further questions and will await the discussion with the other reviewers.

2024-08-10

Dear Reviewer aw3E,

Thank you very much for your kind reply! We will further improve our paper.

Best regards,

Authors

审稿意见

评分: 5置信度: 32024-07-24

This paper studies universal OCO algorithms with fewer projections. Previous work either use $O(\log T)$ projections per round, or have a sub-optimal dependence on $d$ for strongly-convex loss. This work designs a new surrogate loss to achieve tight regret for Lipschitz convex/exp-concave/strongly-convex losses simultaneously, with only 1 projection per round.

优点

The technical contributions are solid: this paper makes a strict improvement over previous results.

The paper is very well-written, clearly introducing the challenges and the main ideas. Details of the analysis and algorithm are nicely explained.

缺点

The contribution seems somewhat incremental to me. The only improvement is a $d$ factor for strongly-convex loss. Such result is nice to know but I'm not sure how significant such it is. In addition, the technical novelty isn't significant either.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

2024-08-04

Many thanks for the constructive reviews. We will revise our paper accordingly.

Q1: The significance of our result.

A1: We emphasize that the $d$ -dependence is significant in online learning studies. We provide the following elaborations and will also revise the paper to highlight this significance more clearly.

For online convex optimization, knowing the minimax optimality regarding $d$ and $T$ is a fundamental question to investigate. It is well-known that for convex/exp-concave/strongly convex functions, the minimax rate is $O(\sqrt{T})$ , $O(d\log T)$ and $O(\log T)$ . Therefore, in universal online learning, achieving the regret bounds that match these minimax rates is also of great significance. As explicitly mentioned by the authors of MetaGrad, van Erven et al., the extra factor $d$ in the regret bound for strongly convex functions is a "real limitation". Our results have successfully achieved minimax optimality in terms of $d$ and $T$ through non-trivial technical ingredients.
Numerous studies in online learning are dedicated to improving regret bounds by reducing dependence on the $d$ . For example, prior work on private OCO achieved an $\tilde{O}(\sqrt{dT}/\epsilon+\sqrt{T})$ regret bound [2], where $\tilde{O}(\cdot)$ hides polylog factors in $T$ . In a subsequent study, this result is improved to $\tilde{O}(d^{1/4}\sqrt{T}/\sqrt{\epsilon}+\sqrt{T})$ [3], which is a significant advancement. Other examples include improving $d$ -dependence for multi-armed bandits, bandit convex optimization, etc.

Therefore, we believe that the result in this paper is significant.

Reference:

[1] van Erven et al. MetaGrad: Adaptation using Multiple Learning Rates in Online Learning. JMLR, 2021.

[2] Jain et al. Differentially Private Online Learning. COLT, 2012.

[3] Kairouz et al. Practical and Private (Deep) Learning Without Sampling or Shuffling. ICML, 2021.

Q2: The significance of our technical novelty.

A2: We would like to take this chance to emphasize our technical novelty.

Challenge: This work focuses on universal algorithms with $1$ projection per round. Prior work applied the black-box reduction to existing universal algorithm, i.e., MetaGrad, which achieves optimal regret for exp-concave and convex functions with $1$ projection per round. The basic idea is to cast the original problem on the constrained domain $\mathcal{X}$ to an alternative one of the surrogate loss on a simpler domain $\mathcal{Y}$ . To handle strongly convex functions, a straightforward way is to use a universal algorithm that supports strongly convex functions, e.g., Maler, as the black-box subroutine. However, this black-box approach fails in deriving a tight regret bound for strongly convex functions, because we are unable to bound this term,

\tilde{O}\left(\sqrt{\sum_{t=1}^T \Vert y_t-x\Vert^2} -\sum_{t=1}^T \Vert x_t-x\Vert^2\right),

where $y_t\in\mathcal{Y}$ is the fake decision on the surrogate domain, and $x_t\in\mathcal{X}$ is the true decision on the feasible domain. The above term is unmanageable since $\Vert y_t-x\Vert\geq\Vert x_t-x\Vert$ . To the best of our knowledge, this technical challenge is entirely new for both universal algorithms and the black-box reduction, with no prior work to draw upon.

Analysis: In this work, we address the above challenge in two steps. First, we introduce a novel meta-expert decomposition and, motivated by this, we specifically design a novel expert-loss. This expert-loss offers the advantage of bounding the meta-regret by the following term,

\tilde{O}\left(\sqrt{\sum_{t=1}^T\langle\nabla g_t(y_t),y_t-y_t^i\rangle^2}-\sum_{t=1}^T\Vert x_t-y_t^i\Vert^2\right).

Second, to bound the above difference, we dig into the properties of the surrogate loss of the black-box reduction. The technical novelty of our theoretical analysis is that the second-order bound in terms of the decision on the surrogate domain (fake decision) can be converted into the one of the decision on the feasible domain (true decision), at the cost of adding an addition positive term $\Delta_T$ (See Lemma 7 for details), that is

\sum_{t=1}^T\langle\nabla g_t(y_t),y_t-y_t^i\rangle^2\leq O\left(\sum_{t=1}^T\langle\nabla g_t(y_t),x_t-y_t^i\rangle^2\right) +\Delta_T,

where $\Delta_T$ is defined in our paper. Furthermore, compared to previous work on black-box reduction, we also prove a tighter connection between the original function and the surrogate loss (See Lemma 3 for details), which introduces a negative term $-\Delta_T$ that can automatically offset the cost arising from the aforementioned conversion. By combining these two bounds, we are able to control the meta-regret under the strong convexity condition, thus ensuring that it does not affect the algorithm's optimality. Therefore, we believe that the technical novelty in our theoretical analysis is valuable for both universal algorithms and the black-box reduction.

Applications: Two-layer algorithms are widely adopted in modern online learning, including universal online learning studied in our paper and also non-stationary online learning. Therefore, our developed techniques can be also useful for non-stationary online learning, especially regarding strongly convex/exp-concave functions. For example, Baby et al. [2022] propose a two-layer algorithm with optimal dynamic regret for strongly convex functions, which also suffers from $O(\log T)$ projection complexity. By applying our technique to a strongly adaptive algorithm with a second-order bound, it is hopeful to reduce the number of projections of their method to $1$ per round.

Overall, we believe that the technical novelty of our work is significant. Thanks for your comments, and we will revise the paper to further highlight the technical novelty.

References:

[4] Baby and Wang. Optimal dynamic regret in proper online learning with strongly convex losses and beyond. AISTATS, 2022.

2024-08-09

Thank you for your detailed response!

Given that optimal dependence on $d,T$ is already known for each sub-case, the $d$ dependence improvement is on how we aggregate them to build universal algorithms. In my opinion, such $d$ dependence improvement is not as significant because the study of universal algorithms mostly remains on theory level so far: how useful universal algorithms are practically is in question, there seems much fewer scenarios that crucially require universal algorithms than those of each sub-case.

As a result, I will maintain my score.

最终决定Accept (poster)

2024-09-25

The paper considers the problem of handling various function classes like convex, exp-concave and strongly convex via 1 algorithm. This sub-area called universal algorithms has been studied significantly in literature. The typical solutions are running log T experts at different learning rates. The default such solution requires log T projections,1 for each expert. This was resolved by Mhammedi et al who gave an algorithm with 1 projection. That algorithm gets the right rate for convex and exp concave but misses the right rate for strongly convex suffering an additional d factor. This is the rate that this paper improves. They do so by changing the problem to the domain of a sphere via a construction by Cutkosky and Orabona.

The paper is pretty well written and as the reviewers point out it lays out the results and the techniques pretty well. Overall the paper makes somewhat of a niche theoretical improvement in my opinion (shared by other reviewers), which is not very surprising. However since the reviewers have given a high score to the paper I am recommending acceptance.