PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
5
4
4
3
3.5
置信度
创新性2.8
质量3.3
清晰度3.0
重要性3.0
NeurIPS 2025

An Ellipsoid Algorithm for Online Convex Optimization

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

A new projection-free algorithm for online convex optimization that relies on a separation oracle and does not require preconditioning of the feasible set.

摘要

We study the problem of Online Convex Optimization (OCO) over a convex set $\mathcal{K} \subset \mathbb{R}^d$, accessed via a separation oracle. While classical projection-based algorithms such as projected Online Gradient Descent (OGD) achieve the optimal $O(\sqrt{T})$ regret, they require computing Euclidean projections onto $\mathcal{K}$ whenever an iterate falls outside the feasible set. These projections can be computationally expensive, especially for complex or high-dimensional sets. Projection-free algorithms address this by replacing projections with alternative oracle-based procedures, such as separation or linear optimization oracles. However, the regret bounds of existing separation-based methods scale poorly with the set's asphericity $\kappa$, defined as the ratio between the radii of the smallest enclosing ball and the largest inscribed ball in $\mathcal{K}$; for ill-conditioned sets, $\kappa$ can be arbitrarily large. We introduce a new separation-based algorithm for OCO that achieves a regret bound of $\tilde{O}(\sqrt{dT} + d^2)$, with only logarithmic dependence on $\kappa$. This removes a key limitation of prior work and eliminates the need for costly geometric pre-processing, such as transforming $\mathcal{K}$ into isotropic position. Our algorithm is based on a novel reduction to online optimization over a sequence of dynamically updated ellipsoids, inspired by the classical ellipsoid method for convex optimization. It requires only $\tilde{O}(1)$ separation oracle calls per round, on par with existing separation-based approaches. These advances make our method particularly well suited for online optimization over geometrically complex feasible sets.
关键词
Online Convex OptimizationProjection-FreeEllipsoid MethodSeparation Oracle

评审与讨论

审稿意见
5

The paper considers the problem of developing efficient projection-free algorithms for online convex optimization (OCO). More formally, at each time tt, the learner (an online algorithm) chooses wtKw_{t} \in \mathcal{K} for some convex set K\mathcal{K}, the adversary chooses a convex loss function ftf_{t} and reveals a subgradient gtft(wt)g_{t} \in \partial f_{t}(w_{t}), and the learner incurs loss ft(wt)f_{t}(w_{t}). The goal of the learner is to minimize the regret t=1Tft(wt)infwKt=1Tft(w)\sum_{t = 1} ^ {T} f_{t}(w_{t}) - \inf_{w \in \mathcal{K}} \sum_{t = 1} ^ {T} f_{t}(w). The optimal regret for the problem is Θ(T)\Theta(\sqrt{T}) achieved via the projected-OGD algorithm. However, OGD requires projections to K\mathcal{K} (whenever the iterate falls outside K\mathcal{K}), which can be prohibitively expensive. Projection-free methods aim to alleviate this issue by assuming access to linear optimization/separation-based oracles. The current paper expands the line of work on separation-based oracles for OCO and proposes an algorithm that achieves a regret bound O~(Td+d2)\tilde{\mathcal{O}}(\sqrt{Td} + d ^ {2}), thereby improving the polynomial dependence on κ\kappa (K\mathcal{K}'s asphericity parameter; defined as the ratio of the smallest circumscribed ball and the largest inscribed ball) in the O(κT)\mathcal{O}(\kappa \sqrt{T}) and O(Td+κd)\mathcal{O}(\sqrt{Td} + \kappa d) regret bounds that appeared in the prior works. The key novelty in the paper is the usage of the ellipsoid method and a reduction to online convex optimization over a sequence of dynamically changing ellipsoids that always contain K\mathcal{K} and progressively well approximate K\mathcal{K}.

优缺点分析

Strengths: The paper considers a fundamental limitation in existing projection-free separation oracle-based methods. Notably, they suffer from a polynomial dependence on κ\kappa, which can be arbitrarily large for ill-conditioned sets. The paper completely manages to resolve the polynomial dependence (only a logarithmic dependence on κ\kappa remains) via a novel method. Remarkably, the proposed algorithm requires the same number of calls (O~(1)\tilde{\mathcal{O}}(1)) to the separation oracle as the existing methods. The paper explains the limitations of existing methods that achieved the O(κT)\mathcal{O}(\kappa \sqrt{T}) and O(Td+κd)\mathcal{O}(\sqrt{Td} + \kappa d) bounds very clearly. The proposed algorithm is built upon these limitations, and Sections 3 and 4 are very well written. The algorithm, although technical, can be understood via the main ideas presented in Sections 3 and 4.1.

Weaknesses: The authors acknowledge in the limitations that the paper does not have experiments to support the theoretical claim. However, I do not consider it a major weakness of the paper. The proposed algorithm has a solid theory and lack of experimental evidence is very minor.

Overall, my assessment of the paper is towards acceptance.

问题

(1) In Definition 2.1, I believe the correct condition for Item II should be v,u>v,w\langle v, u \rangle > \langle v, w \rangle for all wKw \in \mathcal{K}, isn't it?

(2) For the most-relevant prior work [17], what is the computation cost per round? Do they also incur a O(dω)\mathcal{O}(d ^ {\omega}) cost per round? My guess is not, but it would be great if the authors could clarify.

局限性

Yes

最终评判理由

I did not have any concerns and continue to vote for acceptance.

格式问题

No Concerns.

作者回复

We are grateful for your time and for your thoughtful evaluation of our work.

In Definition 2.1, I believe the correct condition for Item II should be v,u>v,w\langle v, u \rangle > \langle v, w \rangle for all wKw \in \mathcal{K}, isn't it?

We will fix this typo.

For the most-relevant prior work [17], what is the computation cost per round? Do they also incur a cost per round? My guess is not, but it would be great if the authors could clarify

The per-round cost of [17] is O~(d2+Cseparation)\tilde{O}(d^2 + C_{\text{separation}}), which can be slightly lower than our cost of O~(dω+Cseparation)\tilde{O}(d^\omega + C_{\text{separation}}), where CseparationC_{\text{separation}} is the cost of performing separation. However, since the cost of performing separation typically dominates the cost of solving a d×dd \times d linear system (which we upper bound by dωd^\omega in our analysis), the overall complexity of our method can be considered comparable to that of [17], or slightly higher.

评论

I thank the authors for their response. I maintain my positive evaluation of the work.

审稿意见
4

This paper proposes a projection-free method for OCO that accesses the feasible set through calls to a separation oracle. Their method has the advantage that the regret only depends on the asphericity κ=R/r\kappa = R/r in the logarithm.

优缺点分析

Strengths:

  1. The approach has the clear advantage that the dependence on the asphericity in the regret is only in the logarithmic. This has not been achieved by existing methods.
  2. The paper clearly states how its analysis methods differ from the related work [17]. In particular, the use of ellipsoids rather than balls for the projection is a clear innovation.
  3. In general, the paper is well written.

Weaknesses:

  1. The paper builds closely on [17], with fairly-minor improvements in guarantees.
  2. The computation cost per iteration depends on the term dωd^{\omega}. It appears that this could result in significantly higher computation cost than existing methods. As such, the regret-per-computation of this method might not be better than existing methods even if the regret is lower for a given TT. Furthermore, it is unclear if the computation cost is less than even projection-based methods (e.g. OGD, RFTL).
  3. Table 1 (which compares the guarantees with existing work) is somewhat confusing in that it shows the number of oracle calls per a round, but doesn't compare the computation cost. This is important given that the motivation for projection-free OCO is to reduce the computation cost of classical methods.

问题

  1. What is the term ω\omega in the computation cost term dωd^{\omega}?
  2. How does the computation cost of the method compare with the following: (a) euclidean projection in every round, (b) putting the set in isotropic position once, (c) existing projection-free methods.

局限性

yes

最终评判理由

I appreciate the author's efforts in the extensive discussion. This discussion answered the clarifying question that I had.

I am maintaining my current score. I believe that this score is appropriate because the paper has (what I see to be) moderate methodological contribution in using an ellipsoid instead of a ball in the algorithm [17] and developing an associated analysis. At the same time, it is less clear how significant the performance guarantees (e.g. regret, computation cost) are as it seems that existing methods can provide better guarantees in many regimes.

格式问题

none

作者回复

We appreciate your careful review and the time you spent on our submission.

The paper builds closely on [17], with fairly-minor improvements in guarantees.

The regret bound in [17] scales linearly with the asphericity, which can be arbitrarily large. In contrast, our bound moves this dependence inside a logarithmic factor, which we view as a significant improvement over the guarantee in [17].

The computation cost per iteration depends on the term . It appears that this could result in significantly higher computation cost than existing methods. As such, the regret-per-computation of this method might not be better than existing methods even if the regret is lower for a given . Furthermore, it is unclear if the computation cost is less than even projection-based methods (e.g. OGD, RFTL).

The dωd^\omega term in our computational cost comes from inverting a matrix (or solving a d×dd \times d linear system), which is required by the PoE subroutine for projecting onto the ellipsoid. Here, ω\omega is the matrix multiplication exponent, currently estimated to be less than 2.3715522.371552. The per-round cost of our algorithm is on the order of dω+Cseparationd^\omega + C_{\text{separation}}, where CseparationC_{\text{separation}} is the cost of performing separation. This can be slightly higher than the per-round cost of the algorithm in [17], which is d2+Cseparationd^2 + C_{\text{separation}}. However, in typical optimization settings, the separation cost CseparationC_{\text{separation}} dominates dωd^\omega (the cost of solving a d×dd\times d linear system), in which case the overall computational cost of our method would be comparable to that of [17].

As for the comparison to Euclidean projection: if we could reduce the cost of Euclidean projection onto an arbitrary set to that of solving a linear system, it would represent a major breakthrough. In general, the cost of Euclidean projection can be much higher than the cost of solving a d×dd \times d linear system (i.e., dωd^\omega). In fact, given access to a separation oracle for K\mathcal{K}, the cost of Euclidean projection onto K\mathcal{K} using state-of-the-art cutting plane methods is on the order of O~(d3+dCseparation)\tilde{O}(d^3 + d \cdot C_{\text{separation}}), which is clearly worse than dω+Cseparationd^\omega + C_{\text{separation}}. Moreover, the complexity of cutting plane methods often hides large multiplicative constants in the O~\tilde{O} notation.

Table 1 (which compares the guarantees with existing work) is somewhat confusing in that it shows the number of oracle calls per a round, but doesn't compare the computation cost. This is important given that the motivation for projection-free OCO is to reduce the computation cost of classical methods.

As mentioned earlier, the separation cost typically dominates the dωd^\omega term, which is why the table focuses on the separation oracle complexity. We will add the other computational costs to the table for completeness.

What is the term ω\omega in the computation cost term dωd^\omega?

ω\omega is the matrix multiplication exponent and is currently estimated to be less than 2.3715522.371552. In our case, we use dωd^\omega as an upper bound on the cost of solving a d×dd \times d linear system.

How does the computation cost of the method compare with the following: (a) euclidean projection in every round, (b) putting the set in isotropic position once, (c) existing projection-free methods.

Cost of Euclidean projection: O~(d3+dCseparation)\tilde{O}(d^3+d\cdot C_{\mathrm{separation}}) (see Lee et al., 2017 reference below). This cost is achieved using state-of-the-art cutting plane methods. We note that the O~\tilde{O} notation can hide large constants, which often makes such methods practical only for small- to medium-sized problems. Alternatively, one could use an interior point method for Euclidean projection, but this requires a self-concordant barrier for the feasible set, which is not always available. Even when such a barrier exists, the typical cost remains greater than d3d^3.

Putting the set in isotropic position: O~(d4)Cseparation\tilde{O}(d^4) \cdot C_{\text{separation}} (see [14]). This can be prohibitive, even if performed only once. Moreover, placing the set in isotropic position does not necessarily eliminate the dependence on asphericity in the regret bound; see the paragraph titled “Limitations of reparametrization” on Line 194 and the response to Reviewer h8Ps.

Per-round cost of existing projection-free methods: O~(d+Cseparation)\tilde{O}(d + C_{\text{separation}}) for [16], and O~(d2+Cseparation)\tilde{O}(d^2 + C_{\mathrm{separation}}) for [17]. The dependence on κ\kappa is linear in the regret bounds of both [16] and [17]; in [16], for example, κ\kappa multiplies T\sqrt{T}. Since the cost of separation typically dominates the cost of solving a d×dd \times d linear system (i.e., dωd^\omega), the overall complexity of our method can be considered comparable to that of [17], or slightly higher.

Per round cost of our approach: O~(dω+Cseparation)\tilde{O}(d^\omega + C_{\text{separation}}); see answer to the second comment above for a comparison with [17].

Lee, Y. T., Sidford, A., & Vempala, S. S. (2018, July). Efficient convex optimization with membership oracles. In Conference On Learning Theory (pp. 1292-1294). PMLR.

评论

Thank you for this detailed explanation.

You mentioned that, in practice, the CsepC_{sep} dominates in the term Csep+dωC_{sep} + d^\omega, and therefore that the computation cost is comparable to [17]. Is there a reason why this would be the case, or typical examples you can point me to?

评论

Perhaps the most relevant setting---and in some sense the best-case scenario---for which CsepC_{\text{sep}} is relatively small is when K\mathcal{K} is a polytope defined by mNm \in \mathbb{N} linear constraints: \mathcal{K} := \left\\{x \in \mathbb{R}^d \mid a_i^\top x \leq b_i, \forall i \in [m] \right\\}, where (ai)Rd(a_i) \subset \mathbb{R}^d and (bi)R(b_i) \subset \mathbb{R}. This is one of the most common setups in convex optimisation.

Separation cost.
In this case, separation reduces to membership testing, which requires computing the mm inner products aixa_i^\top x for i[m]i \in [m]. This gives Csep=O(md).C_{\text{sep}} = O(md). As soon as mdω1d1.37m \geq d^{\omega - 1} \approx d^{1.37}, the cost of separation dominates the cost of solving a d×dd \times d linear system, which is O(dω)O(d^\omega).

Projection cost. As stated in the rebuttal, projection using state-of-the-art cutting-plane methods costs O~(dCsep+d3)=O~(md2+d3).\tilde{O}(d \cdot C_{\text{sep}} + d^3) = \tilde{O}(md^2 + d^3). However, in the polytope setting, it is more common to use an interior-point method, since a self-concordant barrier can be constructed easily: ϕ(x)=i=1mlog(biaix),\phi(x) = -\sum_{i=1}^m \log(b_i - a_i^\top x), which is mm-self-concordant. Using this barrier, the projection cost becomes O~(mdω).\tilde{O}(\sqrt{m} \cdot d^\omega).

Side note. The polytope setting is also one where our approach is more efficient than projection-free methods based on Frank--Wolfe, which require solving a linear optimisation problem at each round. In this case, the cost of linear optimisation is comparable to that of projection using an interior-point method.

Improvements after rebuttal. We will incorporate the additional poly(d)\text{poly}(d) costs into Table 1 in the final version. We will also include the clarifications raised in this rebuttal regarding the cost of the different oracles.

评论

I thank the author for the response above but it seems to me that it is not completely revealing.

One measure I think is interesting when considering the efficiency of online algorithms and makes it possible to compare them to projection-based algorithms is the overall runtime to reach ϵ\epsilon-average regret.

If I am not mistaken, per the discussion on polytopes, the overall runtime of your method to ϵ\epsilon-average regret is at least md2ϵ2+dω+1ϵ2\frac{md^2}{\epsilon^2}+\frac{d^{\omega+1}}{\epsilon^2}. This can be compared against running a trivial benchmark of online gradient descent with interior point to solve projections which will take mdωϵ2\frac{\sqrt{m}d^{\omega}}{\epsilon^2}. Comparing these bounds, if I am not mistaken, shown that there is no regime of dd for which your method is faster.

This should not at all mean that the result in not interesting to me, since the runtime per iteration can matter, but I think more transparent discussion regarding the limitations of this approach and more fair comparisons to other projection-free methods are in order.

Also the comment that linear optimization oracle-based methods are useless here is misleading. No-one will use these unless the polytope has a very specific structure (e.g., well studied combinatorial structure) and this is well stablished in the literature on so-called projection-free methods.

评论

The overall runtime to reach an ϵ\epsilon-average regret is indeed a good metric. As I will explain below, the cost of gradient descent combined with an interior-point method for projections can still exceed that of our proposed method when the number of constraints mm is larger than dd.

Full cost of projection. The cost of Euclidean projection using an Interior-Point Method (IMP) is actually m(dω+md2)\sqrt{m} \cdot (d^{\omega} + md^2); here, m\sqrt{m} is the number of iterations of the IPM, and dω+md2d^{\omega} + md^2 is the per-iteration cost. The O(md2)O(md^2) term that I omitted in my previous comment arises from computing the Hessian of the barrier, which has mm terms. (The term dωd^\omega is for inverting the Hessian.)

Cost of GD + IPM. With this corrected projection cost, the total runtime of gradient descent with interior point method is m(dω+md2)ϵ2\frac{\sqrt{m} (d^{\omega} + md^2)}{\epsilon^2}.

Comparison with the proposed approach. This cost exceeds that of our method, which is md2ϵ2+dω+1ϵ2\frac{md^2}{\epsilon^2} + \frac{d^{\omega + 1}}{\epsilon^2}, as soon as mdm \geq d.

On projection-free approaches based on linear optimization. Projection-free optimization approaches based on linear optimization are indeed very useful and practical—their dimension-free regret guarantees are particularly appealing. My earlier comment was not meant to suggest they are useless. We were simply pointing out a popular setting where, in the absence of additional structure, our approach can be more competitive. Of course, there are also settings where Frank–Wolfe outperforms ours. We view these approaches as complementary.

A more transparent comparison. As mentioned in the previous comment, we will include these discussions in the paper for a more transparent comparison. This should be relatively straightforward and would not require any major changes.

评论

Thank you for the clarification. Still, OGD + solving the projection with the SOTA cutting plane method you mentioned above dominates the running time to reach epsilon average regret. Is this correct?

This leads me to the point: to get your complexity, couldn’t I in principle partition the rounds to blocks of length d and simply simulate the above cutting plane method within each block?

评论

Thank you for the follow-up questions:

OGD + SOTA cutting plane for projection. Indeed, in this case, the overall cost can be slightly better than that of our approach. However, we note the following point:

  • The more standard Vaidya' cutting-plane method has complexity O~(dCsep+d4)\tilde{O}(d \cdot C_{\text{sep}} + d^4) (see [1] below). Achieving a d3d^3 dependence instead of d4d^4 requires a much more involved algorithm and does not necessarily yield a practical method (see [2] below). (Cutting-plane methods are typically more theoretical in nature and often do not scale well in practice.)
  • The O~\tilde{O} in the complexity of cutting-plane methods typically hides large constants, which could partly explains why they often do not scale as well as one might expect.

Side note. Since the submission, we have found a way to improve the dωd^\omega term in the analysis to d2d^2, which would bring the runtime for achieving ϵ\epsilon-average regret in line with that of OGD + state-of-the-art cutting-plane methods in the polytope setting. That said, we believe the results in the paper are already meaningful and valuable even with the dωd^\omega cost.

Block-wise approach. Suppose we adopt a block-wise approach where the cost of projection via cutting-plane methods is amortized over a block of BB rounds (to be determined below).

  • Since we now operate over blocks of size BB, the per-round losses are effectively scaled by BB, which appears linearly in the regret. The resulting regret bound takes the form O(BGRT/B)O(B G R \sqrt{T/B}).
  • The amortized cost per iteration is Ccutting planeB=dCsep+d3B\frac{C_{\text{cutting plane}}}{B} = \frac{d \cdot C_{\text{sep}} + d^3}{B}. To match the O~(1)Csep\tilde{O}(1)\cdot C_{\text{sep}} cost per round of our approach, we would need to set B=dB = d. This would lead to an overall regret bound of O(GRdT)O(G R \sqrt{d T}), which is in line with our guarantee. So the question ultimately becomes whether the cutting-plane approach is a practical one (see bullet points above).

References.

[1] Bubeck, Sébastien. "Convex optimization: Algorithms and complexity." Foundations and Trends® in Machine Learning 8.3-4 (2015): 231-357.

[2] Lee, Yin Tat, Aaron Sidford, and Sam Chiu-wai Wong. "A faster cutting plane method and its implications for combinatorial and convex optimization." 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 2015.

评论

Dear reviewer/AC,

Did we address all your questions? As the discussion period deadline approaches, please let us know if you have any additional questions or concerns we can address.

审稿意见
4

The paper proposes a novel projection-free algorithm for Online Convex Optimization (OCO) that relies solely on a separation oracle. Prior projection-free methods based on separation oracles suffer from poor dependence on the asphericity κ\kappa of the feasible set. This work introduces an ellipsoid-based reduction combined with an ONS-type inner algorithm, achieving a regret of O~(dT+d2)\tilde{\mathcal{O}}(\sqrt{dT}+d^2) with only logarithmic dependence on κ\kappa, while preserving per-iteration oracle complexity. It significantly improves over previous methods in settings with ill-conditioned feasible sets.

优缺点分析

Strengths

  • Improved Regret Bound: Achieves better dependence on asphericity κ\kappa, improving practicality for ill-conditioned problems.

  • Methodical Analysis: The paper provides detailed and rigorous proofs, including algorithmic design, update rules, and complexity.

  • Conceptual Clarity: The use of Gauge projections and separation oracles is clearly explained and builds logically on prior work.

Weaknesses

  • Computational Overhead of Ellipsoid Updates: These updates involve operations with per-round complexity of O(dω)\mathcal{O}(d^\omega), where ω\omega is the matrix multiplication exponent. Any estimation for ω\omega? Would it influence the overall computational complexity?

  • Lack of Empirical Validation: The paper focuses entirely on theoretical contributions and does not include any empirical results. While the regret guarantees are strong, it is unclear how the algorithm performs in practice. Empirical validation—such as experiments on standard OCO benchmarks or simulations comparing runtime and regret with other projection-free methods—would significantly strengthen the paper.

  • Minor issue: It seems that you forgot to mention the full name of ONS.

问题

Please see above.

局限性

Please see above.

最终评判理由

I think this is a good paper and I would like to keep my score.

格式问题

N/A

作者回复

Thank you for your comments and for reviewing our paper.

Computational Overhead of Ellipsoid Updates: These updates involve operations with per-round complexity of O(dω)\mathcal{O}(d^\omega), where ω\omega is the matrix multiplication exponent. Any estimation for ω\omega? Would it influence the overall computational complexity?

The dωd^\omega term in our computational cost arises from the cost of inverting a matrix (or solving a d×dd \times d linear system). This operation is required by the PoE subroutine to project onto the ellipsoid. The exponent ω\omega is currently estimated to be less than 2.3715522.371552. However, even if we take ω=3\omega = 3, the cost dωd^\omega—that is, the cost of solving a linear d×dd \times d system—remains much smaller than the cost of Euclidean projection for many feasible sets of interest

Lack of Empirical Validation: The paper focuses entirely on theoretical contributions and does not include any empirical results. While the regret guarantees are strong, it is unclear how the algorithm performs in practice. Empirical validation—such as experiments on standard OCO benchmarks or simulations comparing runtime and regret with other projection-free methods—would significantly strengthen the paper.

We have acknowledged the lack of experiments in the limitations paragraph. Nevertheless, we believe the theoretical results we provide stand on their own.

Minor issue: It seems that you forgot to mention the full name of ONS. We will address this. Thank you for pointing this out.

审稿意见
3

This paper proposes a novel separation-based algorithm for online convex optimization (OCO) to improve the sample complexities in terms of the shape-dependency of the action set. The separation-based algorithms in prior works were proposed to mitigate the computation complexities due to the projection steps of the gradient-based methods, however, they suffer from having regrets that scale polynomially with respect to the asphericity of the action set. The algorithm proposed in this work achieves a regret bound that only depends polylogarithmically on the asphericity, albeit having a worse dimension dependence.

优缺点分析

Strengths: OCO is a central problem in online learning and improvements in its regret bounds under computational constraints are impactful. The algorithm proposed in this work achieves improved performances compared to best known separation-based algorithms when the shape of the action set is highly skewed. The proposed algorithm was explained in details, with clear update condition and intuitions, and their performance is analyzed and adequately compared to prior works.

Weaknesses:

  1. While the regret performance achieved by the proposed algorithm is order-wise larger than the sample-optimal gradient-based algorithms, they can also be order-wise larger compared to separation-based algorithms in prior works in certain regimes, e.g. when dimension d is large, due to the additional d factors in the stated upper bounds.
  2. The assumptions and prior results can be better clarified. For instance, there is a typo in the second bullet point of definition 2.1 which makes the inequality being always invalid and the meaning of the separation oracle totally unclear. It would also be helpful to clarify what are the dependency of R and G for the regrets achieved in prior works to ensure the comparison is approprate.

问题

  1. As elaborated in the weakness section, we recommend the authors to revise the manuscript to correct typos in problem formulation and clarify the dependencies of all parameters when comparing regrets in related works.

  2. The authors presents a natural extension to stochastic optimization on page 9, however, it is not clear what is the oracle model considered here. We recommend the authors to clarify the oracle model, as well as providing a brief overview of the best know results in different settings (e.g., zeroth order, first order, or having access to both).

  3. The asphericity of the action set is often not a fundamental constraint in the sense that a hyperelliposid action set could have large asphericity, but it is trivial to achieve the optimal O(\sqrt{T}) regret by operating in a linearly transformed space where the action set is a hyperball while constraint parameters G and R remain the same. Could the authors provide an example where the dependency of asphericity can not be resolved by extending [16][7] through this approach.

局限性

yes

最终评判理由

While the authors promised to revise the manuscript to correct the typos, the paper could benefit from examples where high asphericity gives a fundamental lower limit on the required sample complexities. The reviewer believes that a simple ellipsoid action set wouldn't serve as such a hard instance, since it can be resolved with the linear-transformation-based approach. The authors wanted to defend this example as a hard instance, but it appears that there is a mismatch in their comparison between the outer radius of the two action sets.

Overall, the topic of the paper is interesting, but perhaps similar to the point raised by AC it would be beneficial if this result could be accompanied by some converse arguments (even informal ones) to demonstrate why the proposed rates can not be easily improved. Therefore, my score is unchanged. However, the topic investigated in this work is interesting enough, so I won't have a strong opinion against its acceptance.

格式问题

NA

作者回复

Thank you for taking the time to review our work and provide your feedback.

While the regret performance achieved by the proposed algorithm is order-wise larger than the sample-optimal gradient-based algorithms, they can also be order-wise larger compared to separation-based algorithms in prior works in certain regimes, e.g. when dimension d is large, due to the additional d factors in the stated upper bounds.

There are regimes where our regret bound is worse than that of previous separation-based algorithms, such as when κ\kappa is already small. For this reason, our approach can be seen as complementary to existing methods.

As a side note, it is possible to design an algorithm that achieves a regret matching the best of the bounds obtained by our method and those in [16] and [17]. One such approach would aggregate the outputs of the individual algorithms using exponential weights; see, for example, reference [1] below.

As elaborated in the weakness section, we recommend the authors to revise the manuscript to correct typos in problem formulation and clarify the dependencies of all parameters when comparing regrets in related works.

We will correct the typo in the definition of the separation oracle. All the regret bounds—both ours and those we cite—depend linearly on the product RG, which is why we omitted it. This is generally the case for OCO algorithms, as their outputs are typically equivariant to the input scales R and G. The exception is algorithms that adapt to R and G, which is beyond the scope of this paper.

The authors presents a natural extension to stochastic optimization on page 9, however, it is not clear what is the oracle model considered here. We recommend the authors to clarify the oracle model, as well as providing a brief overview of the best know results in different settings (e.g., zeroth order, first order, or having access to both).

We consider the standard stochastic optimization setting with access to a first-order stochastic gradient oracle. Some details specific to the stochastic setting were moved to Appendix F due to space constraints. We will restore more of this material to the main text thanks to the additional page that is allowed for the camera ready.

The asphericity of the action set is often not a fundamental constraint in the sense that a hyperelliposid action set could have large asphericity, but it is trivial to achieve the optimal O(\sqrt{T}) regret by operating in a linearly transformed space where the action set is a hyperball while constraint parameters G and R remain the same. Could the authors provide an example where the dependency of asphericity can not be resolved by extending [16][7] through this approach.

This is not always true, and here is why. Let WRd×dW \in \mathbb{R}^{d \times d} be a positive-definite matrix, and let E=xRdxW1x1\mathcal{E} = \\{x \in \mathbb{R}^d \mid x^\top W^{-1} x \leq 1\\} be an ellipsoid. Assume that λmin(W)1\lambda_{\min}(W) \geq 1, so that EB(1)\mathcal{E} \subseteq \mathbb{B}(1). Suppose we are in the OCO setting, where E\mathcal{E} is the feasible set and the losses ftf_t are GG-Lipschitz over E\mathcal{E}; that is, ft(x)G\\|\nabla f_t(x)\\| \leq G for all xEx \in \mathcal{E}.

Now, if we apply a reparametrization to work on the Euclidean ball instead, the reparametrized losses we work with are f~t:B(1)R\tilde{f}_t: \mathbb{B}(1) \rightarrow \mathbb{R} given by f~t(y)=ft(W1/2y)\tilde{f}_t(y) = f_t(W^{1/2} y). By the chain rule, the gradients of f~t\tilde{f}_t satisfy f~t(y)=W1/2ft(W1/2y)\nabla \tilde{f}_t(y) = W^{1/2} \nabla f_t(W^{1/2} y) (assuming ftf_t is differentiable for simplicity). Therefore, the Lipschitz constant of f~t\tilde{f}_t in the reparametrized problem can be as large as λmax(W)G\sqrt{\lambda{\max}(W)} G, which can be arbitrarily larger than GG for ill-conditioned E\mathcal{E}. Here, λmax(W)\sqrt{\lambda{\max}(W)} is a measure of the asphericity of E\mathcal{E}.

Since the regret bound scales linearly with the Lipschitz constant of the losses (see the answer to the first question), working with the reparametrized losses in this way can introduce a dependence on the asphericity in the regret bound. This issue with reparametrization holds more broadly (beyond the ellipsoid setting), as we discuss in the paragraph titled “Limitations of reparametrization” on Line 194.

[1] Dylan J Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, pages 6020–6030, 2017.

评论

We thank the authors for their response. Two quick follow-up questions:

  1. While the authors noted plans to correct the typos, it would be appreciated if the revised version of Definition 2.1 could be shared in the discussion as well. Although the intended meaning can be inferred, confirming the corrected form would be helpful, given its central role in the paper.

  2. Regarding the clarification of the hyperelipsoid example provided by the authors, we would like to confirm that this analysis does not directly induce a necessary dependency of the loss on asphericity. To see why this is the case, we can always transform the action set with (W/λmax(W))1/2(W/\lambda_{\max}(W))^{1/2}, which only increases the lengths of shorter axes of the hyperelipsoid. This would not violate the existing constraints for both G and R.

评论

Thank you for your follow-up questions.

Here is the corrected definition of the separation oracle.

Definition 2.1. A separation oracle SepC\text{Sep}_{\mathcal{C}} for a set C\mathcal{C} is an oracle that, given uRdu \in \mathbb{R}^d, returns a pair (b,v)0,1×B(1)(b, v) \in {0,1} \times \mathbb{B}(1) (where B(1)\mathbb{B}(1) is the unit Euclidean ball in Rd\mathbb{R}^d), such that:

  • b=0b = 0 and v=0v = 0 if uCu \in \mathcal{C}; otherwise,
  • b=1b = 1 and v,u>v,w\langle v, u \rangle > \langle v, w \rangle for all wCw \in \mathcal{C}.

The hyperellipsoid case.

If we understand correctly, you are suggesting working with the reparametrized function fˉt((W/λmax(W))1/2y)\bar f_t((W/\lambda_{\max}(W))^{1/2} y) instead of the one we originally proposed, f~t(W1/2y)\tilde f_t(W^{1/2} y). In this case, the norm of gradient of fˉt\bar f_t would indeed be at most GG, which is great. However, the domain of fˉt\bar f_t would now be Kˉ=λmax(W)1/2B(1)\bar{\mathcal{K}} = \lambda_{\max}(W)^{1/2} \cdot \mathbb{B}(1) instead of just B(1)\mathbb{B}(1) (we explain why below). Since the regret bound scales linearly with the product of the Lipschitz constant and the radius of the domain (as mentioned in my initial comment), the λmax(W)1/2\lambda_{\max}(W)^{1/2} factor would reappear in the bound; because the radius of the reparametrized set is now λmax(W)1/2/2\lambda_{\max}(W)^{1/2}/2.

The reason the domain of fˉt\bar f_t must be λmax(W)B(1)\lambda_{\max}(W) \cdot \mathbb{B}(1) is that we want the reparametrization to preserve a one-to-one correspondence between the original set and the reparametrized set. To determine the mapped set under your proposed reparametrization, we apply the transformation (W/λmax(W))1/2(W/\lambda_{\max}(W))^{-1/2} to the original ellipsoid E\mathcal{E}, which yields a Euclidean ball of radius λmax(W)1/2\lambda_{\max}(W)^{1/2}.

评论

We thank the authors for the clarification. However, recall that the initial action set is given by E={xRdxW1x1}\mathcal{E} = \{x \in \mathbb{R}^d \mid x^\top W^{-1} x \leq 1\}, we already started with an outer radius of R=λmax(W)R=\sqrt{\lambda_{\max}(W)}. So even after transformation, we have an action space of radius that could be greater than 1, its upper bound RR remains the same, which is not increased, and the same analysis for κ=1\kappa=1 should be applicable since both GG and RR do not have to be changed (this is what I meant by "This would not violate the existing constraints for both G and R.").

评论

That is a fair point. Given the closing deadline for the author response, we’d like to end on the following point: while computing the appropriate reparametrization for the ellipsoid was cheap, doing it for general sets can cost O(d^4) calls to a separation oracle, making it prohibitive even if done once. (We mention this in the paper on line 194)

最终决定

Overall the opinion of the authors about this paper is positive. We had long discussions with the authors and I went over the paper myself. It is agreed that the paper makes an interesting contribution to the field of projection-free OCO, which received significant interest in recent years. I urge the authors to fulfill their promise that if the paper is accepted the discussions regarding previous projection methods and the limitations of this work would be better clarified: When comparing to previous work it is highly important to also include the runtime (and even memory requirement) and the dependence of regret bound on the dimension, e.g., when comparing with Hazan & Kale 2012 and other works. It may also be interesting to discuss the comparison to projection-based methods in terms of the overall time to reach epsilon average regret. This is so important, since this method (which improves upon a recent similar result), while is still so-called "projeciton-free" is becoming more and more computationally demanding in terms of runtime (beyond oracle implementation) and memory and has regret bound which scales with the dimension.