An Ellipsoid Algorithm for Online Convex Optimization
A new projection-free algorithm for online convex optimization that relies on a separation oracle and does not require preconditioning of the feasible set.
摘要
评审与讨论
The paper considers the problem of developing efficient projection-free algorithms for online convex optimization (OCO). More formally, at each time , the learner (an online algorithm) chooses for some convex set , the adversary chooses a convex loss function and reveals a subgradient , and the learner incurs loss . The goal of the learner is to minimize the regret . The optimal regret for the problem is achieved via the projected-OGD algorithm. However, OGD requires projections to (whenever the iterate falls outside ), which can be prohibitively expensive. Projection-free methods aim to alleviate this issue by assuming access to linear optimization/separation-based oracles. The current paper expands the line of work on separation-based oracles for OCO and proposes an algorithm that achieves a regret bound , thereby improving the polynomial dependence on ('s asphericity parameter; defined as the ratio of the smallest circumscribed ball and the largest inscribed ball) in the and regret bounds that appeared in the prior works. The key novelty in the paper is the usage of the ellipsoid method and a reduction to online convex optimization over a sequence of dynamically changing ellipsoids that always contain and progressively well approximate .
优缺点分析
Strengths: The paper considers a fundamental limitation in existing projection-free separation oracle-based methods. Notably, they suffer from a polynomial dependence on , which can be arbitrarily large for ill-conditioned sets. The paper completely manages to resolve the polynomial dependence (only a logarithmic dependence on remains) via a novel method. Remarkably, the proposed algorithm requires the same number of calls () to the separation oracle as the existing methods. The paper explains the limitations of existing methods that achieved the and bounds very clearly. The proposed algorithm is built upon these limitations, and Sections 3 and 4 are very well written. The algorithm, although technical, can be understood via the main ideas presented in Sections 3 and 4.1.
Weaknesses: The authors acknowledge in the limitations that the paper does not have experiments to support the theoretical claim. However, I do not consider it a major weakness of the paper. The proposed algorithm has a solid theory and lack of experimental evidence is very minor.
Overall, my assessment of the paper is towards acceptance.
问题
(1) In Definition 2.1, I believe the correct condition for Item II should be for all , isn't it?
(2) For the most-relevant prior work [17], what is the computation cost per round? Do they also incur a cost per round? My guess is not, but it would be great if the authors could clarify.
局限性
Yes
最终评判理由
I did not have any concerns and continue to vote for acceptance.
格式问题
No Concerns.
We are grateful for your time and for your thoughtful evaluation of our work.
In Definition 2.1, I believe the correct condition for Item II should be for all , isn't it?
We will fix this typo.
For the most-relevant prior work [17], what is the computation cost per round? Do they also incur a cost per round? My guess is not, but it would be great if the authors could clarify
The per-round cost of [17] is , which can be slightly lower than our cost of , where is the cost of performing separation. However, since the cost of performing separation typically dominates the cost of solving a linear system (which we upper bound by in our analysis), the overall complexity of our method can be considered comparable to that of [17], or slightly higher.
I thank the authors for their response. I maintain my positive evaluation of the work.
This paper proposes a projection-free method for OCO that accesses the feasible set through calls to a separation oracle. Their method has the advantage that the regret only depends on the asphericity in the logarithm.
优缺点分析
Strengths:
- The approach has the clear advantage that the dependence on the asphericity in the regret is only in the logarithmic. This has not been achieved by existing methods.
- The paper clearly states how its analysis methods differ from the related work [17]. In particular, the use of ellipsoids rather than balls for the projection is a clear innovation.
- In general, the paper is well written.
Weaknesses:
- The paper builds closely on [17], with fairly-minor improvements in guarantees.
- The computation cost per iteration depends on the term . It appears that this could result in significantly higher computation cost than existing methods. As such, the regret-per-computation of this method might not be better than existing methods even if the regret is lower for a given . Furthermore, it is unclear if the computation cost is less than even projection-based methods (e.g. OGD, RFTL).
- Table 1 (which compares the guarantees with existing work) is somewhat confusing in that it shows the number of oracle calls per a round, but doesn't compare the computation cost. This is important given that the motivation for projection-free OCO is to reduce the computation cost of classical methods.
问题
- What is the term in the computation cost term ?
- How does the computation cost of the method compare with the following: (a) euclidean projection in every round, (b) putting the set in isotropic position once, (c) existing projection-free methods.
局限性
yes
最终评判理由
I appreciate the author's efforts in the extensive discussion. This discussion answered the clarifying question that I had.
I am maintaining my current score. I believe that this score is appropriate because the paper has (what I see to be) moderate methodological contribution in using an ellipsoid instead of a ball in the algorithm [17] and developing an associated analysis. At the same time, it is less clear how significant the performance guarantees (e.g. regret, computation cost) are as it seems that existing methods can provide better guarantees in many regimes.
格式问题
none
We appreciate your careful review and the time you spent on our submission.
The paper builds closely on [17], with fairly-minor improvements in guarantees.
The regret bound in [17] scales linearly with the asphericity, which can be arbitrarily large. In contrast, our bound moves this dependence inside a logarithmic factor, which we view as a significant improvement over the guarantee in [17].
The computation cost per iteration depends on the term . It appears that this could result in significantly higher computation cost than existing methods. As such, the regret-per-computation of this method might not be better than existing methods even if the regret is lower for a given . Furthermore, it is unclear if the computation cost is less than even projection-based methods (e.g. OGD, RFTL).
The term in our computational cost comes from inverting a matrix (or solving a linear system), which is required by the PoE subroutine for projecting onto the ellipsoid. Here, is the matrix multiplication exponent, currently estimated to be less than . The per-round cost of our algorithm is on the order of , where is the cost of performing separation. This can be slightly higher than the per-round cost of the algorithm in [17], which is . However, in typical optimization settings, the separation cost dominates (the cost of solving a linear system), in which case the overall computational cost of our method would be comparable to that of [17].
As for the comparison to Euclidean projection: if we could reduce the cost of Euclidean projection onto an arbitrary set to that of solving a linear system, it would represent a major breakthrough. In general, the cost of Euclidean projection can be much higher than the cost of solving a linear system (i.e., ). In fact, given access to a separation oracle for , the cost of Euclidean projection onto using state-of-the-art cutting plane methods is on the order of , which is clearly worse than . Moreover, the complexity of cutting plane methods often hides large multiplicative constants in the notation.
Table 1 (which compares the guarantees with existing work) is somewhat confusing in that it shows the number of oracle calls per a round, but doesn't compare the computation cost. This is important given that the motivation for projection-free OCO is to reduce the computation cost of classical methods.
As mentioned earlier, the separation cost typically dominates the term, which is why the table focuses on the separation oracle complexity. We will add the other computational costs to the table for completeness.
What is the term in the computation cost term ?
is the matrix multiplication exponent and is currently estimated to be less than . In our case, we use as an upper bound on the cost of solving a linear system.
How does the computation cost of the method compare with the following: (a) euclidean projection in every round, (b) putting the set in isotropic position once, (c) existing projection-free methods.
Cost of Euclidean projection: (see Lee et al., 2017 reference below). This cost is achieved using state-of-the-art cutting plane methods. We note that the notation can hide large constants, which often makes such methods practical only for small- to medium-sized problems. Alternatively, one could use an interior point method for Euclidean projection, but this requires a self-concordant barrier for the feasible set, which is not always available. Even when such a barrier exists, the typical cost remains greater than .
Putting the set in isotropic position: (see [14]). This can be prohibitive, even if performed only once. Moreover, placing the set in isotropic position does not necessarily eliminate the dependence on asphericity in the regret bound; see the paragraph titled “Limitations of reparametrization” on Line 194 and the response to Reviewer h8Ps.
Per-round cost of existing projection-free methods: for [16], and for [17]. The dependence on is linear in the regret bounds of both [16] and [17]; in [16], for example, multiplies . Since the cost of separation typically dominates the cost of solving a linear system (i.e., ), the overall complexity of our method can be considered comparable to that of [17], or slightly higher.
Per round cost of our approach: ; see answer to the second comment above for a comparison with [17].
Lee, Y. T., Sidford, A., & Vempala, S. S. (2018, July). Efficient convex optimization with membership oracles. In Conference On Learning Theory (pp. 1292-1294). PMLR.
Thank you for this detailed explanation.
You mentioned that, in practice, the dominates in the term , and therefore that the computation cost is comparable to [17]. Is there a reason why this would be the case, or typical examples you can point me to?
Perhaps the most relevant setting---and in some sense the best-case scenario---for which is relatively small is when is a polytope defined by linear constraints: \mathcal{K} := \left\\{x \in \mathbb{R}^d \mid a_i^\top x \leq b_i, \forall i \in [m] \right\\}, where and . This is one of the most common setups in convex optimisation.
Separation cost.
In this case, separation reduces to membership testing, which requires computing the inner products for . This gives
As soon as , the cost of separation dominates the cost of solving a linear system, which is .
Projection cost. As stated in the rebuttal, projection using state-of-the-art cutting-plane methods costs However, in the polytope setting, it is more common to use an interior-point method, since a self-concordant barrier can be constructed easily: which is -self-concordant. Using this barrier, the projection cost becomes
Side note. The polytope setting is also one where our approach is more efficient than projection-free methods based on Frank--Wolfe, which require solving a linear optimisation problem at each round. In this case, the cost of linear optimisation is comparable to that of projection using an interior-point method.
Improvements after rebuttal. We will incorporate the additional costs into Table 1 in the final version. We will also include the clarifications raised in this rebuttal regarding the cost of the different oracles.
I thank the author for the response above but it seems to me that it is not completely revealing.
One measure I think is interesting when considering the efficiency of online algorithms and makes it possible to compare them to projection-based algorithms is the overall runtime to reach -average regret.
If I am not mistaken, per the discussion on polytopes, the overall runtime of your method to -average regret is at least . This can be compared against running a trivial benchmark of online gradient descent with interior point to solve projections which will take . Comparing these bounds, if I am not mistaken, shown that there is no regime of for which your method is faster.
This should not at all mean that the result in not interesting to me, since the runtime per iteration can matter, but I think more transparent discussion regarding the limitations of this approach and more fair comparisons to other projection-free methods are in order.
Also the comment that linear optimization oracle-based methods are useless here is misleading. No-one will use these unless the polytope has a very specific structure (e.g., well studied combinatorial structure) and this is well stablished in the literature on so-called projection-free methods.
The overall runtime to reach an -average regret is indeed a good metric. As I will explain below, the cost of gradient descent combined with an interior-point method for projections can still exceed that of our proposed method when the number of constraints is larger than .
Full cost of projection. The cost of Euclidean projection using an Interior-Point Method (IMP) is actually ; here, is the number of iterations of the IPM, and is the per-iteration cost. The term that I omitted in my previous comment arises from computing the Hessian of the barrier, which has terms. (The term is for inverting the Hessian.)
Cost of GD + IPM. With this corrected projection cost, the total runtime of gradient descent with interior point method is .
Comparison with the proposed approach. This cost exceeds that of our method, which is , as soon as .
On projection-free approaches based on linear optimization. Projection-free optimization approaches based on linear optimization are indeed very useful and practical—their dimension-free regret guarantees are particularly appealing. My earlier comment was not meant to suggest they are useless. We were simply pointing out a popular setting where, in the absence of additional structure, our approach can be more competitive. Of course, there are also settings where Frank–Wolfe outperforms ours. We view these approaches as complementary.
A more transparent comparison. As mentioned in the previous comment, we will include these discussions in the paper for a more transparent comparison. This should be relatively straightforward and would not require any major changes.
Thank you for the clarification. Still, OGD + solving the projection with the SOTA cutting plane method you mentioned above dominates the running time to reach epsilon average regret. Is this correct?
This leads me to the point: to get your complexity, couldn’t I in principle partition the rounds to blocks of length d and simply simulate the above cutting plane method within each block?
Thank you for the follow-up questions:
OGD + SOTA cutting plane for projection. Indeed, in this case, the overall cost can be slightly better than that of our approach. However, we note the following point:
- The more standard Vaidya' cutting-plane method has complexity (see [1] below). Achieving a dependence instead of requires a much more involved algorithm and does not necessarily yield a practical method (see [2] below). (Cutting-plane methods are typically more theoretical in nature and often do not scale well in practice.)
- The in the complexity of cutting-plane methods typically hides large constants, which could partly explains why they often do not scale as well as one might expect.
Side note. Since the submission, we have found a way to improve the term in the analysis to , which would bring the runtime for achieving -average regret in line with that of OGD + state-of-the-art cutting-plane methods in the polytope setting. That said, we believe the results in the paper are already meaningful and valuable even with the cost.
Block-wise approach. Suppose we adopt a block-wise approach where the cost of projection via cutting-plane methods is amortized over a block of rounds (to be determined below).
- Since we now operate over blocks of size , the per-round losses are effectively scaled by , which appears linearly in the regret. The resulting regret bound takes the form .
- The amortized cost per iteration is . To match the cost per round of our approach, we would need to set . This would lead to an overall regret bound of , which is in line with our guarantee. So the question ultimately becomes whether the cutting-plane approach is a practical one (see bullet points above).
References.
[1] Bubeck, Sébastien. "Convex optimization: Algorithms and complexity." Foundations and Trends® in Machine Learning 8.3-4 (2015): 231-357.
[2] Lee, Yin Tat, Aaron Sidford, and Sam Chiu-wai Wong. "A faster cutting plane method and its implications for combinatorial and convex optimization." 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE, 2015.
Dear reviewer/AC,
Did we address all your questions? As the discussion period deadline approaches, please let us know if you have any additional questions or concerns we can address.
The paper proposes a novel projection-free algorithm for Online Convex Optimization (OCO) that relies solely on a separation oracle. Prior projection-free methods based on separation oracles suffer from poor dependence on the asphericity of the feasible set. This work introduces an ellipsoid-based reduction combined with an ONS-type inner algorithm, achieving a regret of with only logarithmic dependence on , while preserving per-iteration oracle complexity. It significantly improves over previous methods in settings with ill-conditioned feasible sets.
优缺点分析
Strengths
-
Improved Regret Bound: Achieves better dependence on asphericity , improving practicality for ill-conditioned problems.
-
Methodical Analysis: The paper provides detailed and rigorous proofs, including algorithmic design, update rules, and complexity.
-
Conceptual Clarity: The use of Gauge projections and separation oracles is clearly explained and builds logically on prior work.
Weaknesses
-
Computational Overhead of Ellipsoid Updates: These updates involve operations with per-round complexity of , where is the matrix multiplication exponent. Any estimation for ? Would it influence the overall computational complexity?
-
Lack of Empirical Validation: The paper focuses entirely on theoretical contributions and does not include any empirical results. While the regret guarantees are strong, it is unclear how the algorithm performs in practice. Empirical validation—such as experiments on standard OCO benchmarks or simulations comparing runtime and regret with other projection-free methods—would significantly strengthen the paper.
-
Minor issue: It seems that you forgot to mention the full name of ONS.
问题
Please see above.
局限性
Please see above.
最终评判理由
I think this is a good paper and I would like to keep my score.
格式问题
N/A
Thank you for your comments and for reviewing our paper.
Computational Overhead of Ellipsoid Updates: These updates involve operations with per-round complexity of , where is the matrix multiplication exponent. Any estimation for ? Would it influence the overall computational complexity?
The term in our computational cost arises from the cost of inverting a matrix (or solving a linear system). This operation is required by the PoE subroutine to project onto the ellipsoid. The exponent is currently estimated to be less than . However, even if we take , the cost —that is, the cost of solving a linear system—remains much smaller than the cost of Euclidean projection for many feasible sets of interest
Lack of Empirical Validation: The paper focuses entirely on theoretical contributions and does not include any empirical results. While the regret guarantees are strong, it is unclear how the algorithm performs in practice. Empirical validation—such as experiments on standard OCO benchmarks or simulations comparing runtime and regret with other projection-free methods—would significantly strengthen the paper.
We have acknowledged the lack of experiments in the limitations paragraph. Nevertheless, we believe the theoretical results we provide stand on their own.
Minor issue: It seems that you forgot to mention the full name of ONS. We will address this. Thank you for pointing this out.
This paper proposes a novel separation-based algorithm for online convex optimization (OCO) to improve the sample complexities in terms of the shape-dependency of the action set. The separation-based algorithms in prior works were proposed to mitigate the computation complexities due to the projection steps of the gradient-based methods, however, they suffer from having regrets that scale polynomially with respect to the asphericity of the action set. The algorithm proposed in this work achieves a regret bound that only depends polylogarithmically on the asphericity, albeit having a worse dimension dependence.
优缺点分析
Strengths: OCO is a central problem in online learning and improvements in its regret bounds under computational constraints are impactful. The algorithm proposed in this work achieves improved performances compared to best known separation-based algorithms when the shape of the action set is highly skewed. The proposed algorithm was explained in details, with clear update condition and intuitions, and their performance is analyzed and adequately compared to prior works.
Weaknesses:
- While the regret performance achieved by the proposed algorithm is order-wise larger than the sample-optimal gradient-based algorithms, they can also be order-wise larger compared to separation-based algorithms in prior works in certain regimes, e.g. when dimension d is large, due to the additional d factors in the stated upper bounds.
- The assumptions and prior results can be better clarified. For instance, there is a typo in the second bullet point of definition 2.1 which makes the inequality being always invalid and the meaning of the separation oracle totally unclear. It would also be helpful to clarify what are the dependency of R and G for the regrets achieved in prior works to ensure the comparison is approprate.
问题
-
As elaborated in the weakness section, we recommend the authors to revise the manuscript to correct typos in problem formulation and clarify the dependencies of all parameters when comparing regrets in related works.
-
The authors presents a natural extension to stochastic optimization on page 9, however, it is not clear what is the oracle model considered here. We recommend the authors to clarify the oracle model, as well as providing a brief overview of the best know results in different settings (e.g., zeroth order, first order, or having access to both).
-
The asphericity of the action set is often not a fundamental constraint in the sense that a hyperelliposid action set could have large asphericity, but it is trivial to achieve the optimal O(\sqrt{T}) regret by operating in a linearly transformed space where the action set is a hyperball while constraint parameters G and R remain the same. Could the authors provide an example where the dependency of asphericity can not be resolved by extending [16][7] through this approach.
局限性
yes
最终评判理由
While the authors promised to revise the manuscript to correct the typos, the paper could benefit from examples where high asphericity gives a fundamental lower limit on the required sample complexities. The reviewer believes that a simple ellipsoid action set wouldn't serve as such a hard instance, since it can be resolved with the linear-transformation-based approach. The authors wanted to defend this example as a hard instance, but it appears that there is a mismatch in their comparison between the outer radius of the two action sets.
Overall, the topic of the paper is interesting, but perhaps similar to the point raised by AC it would be beneficial if this result could be accompanied by some converse arguments (even informal ones) to demonstrate why the proposed rates can not be easily improved. Therefore, my score is unchanged. However, the topic investigated in this work is interesting enough, so I won't have a strong opinion against its acceptance.
格式问题
NA
Thank you for taking the time to review our work and provide your feedback.
While the regret performance achieved by the proposed algorithm is order-wise larger than the sample-optimal gradient-based algorithms, they can also be order-wise larger compared to separation-based algorithms in prior works in certain regimes, e.g. when dimension d is large, due to the additional d factors in the stated upper bounds.
There are regimes where our regret bound is worse than that of previous separation-based algorithms, such as when is already small. For this reason, our approach can be seen as complementary to existing methods.
As a side note, it is possible to design an algorithm that achieves a regret matching the best of the bounds obtained by our method and those in [16] and [17]. One such approach would aggregate the outputs of the individual algorithms using exponential weights; see, for example, reference [1] below.
As elaborated in the weakness section, we recommend the authors to revise the manuscript to correct typos in problem formulation and clarify the dependencies of all parameters when comparing regrets in related works.
We will correct the typo in the definition of the separation oracle. All the regret bounds—both ours and those we cite—depend linearly on the product RG, which is why we omitted it. This is generally the case for OCO algorithms, as their outputs are typically equivariant to the input scales R and G. The exception is algorithms that adapt to R and G, which is beyond the scope of this paper.
The authors presents a natural extension to stochastic optimization on page 9, however, it is not clear what is the oracle model considered here. We recommend the authors to clarify the oracle model, as well as providing a brief overview of the best know results in different settings (e.g., zeroth order, first order, or having access to both).
We consider the standard stochastic optimization setting with access to a first-order stochastic gradient oracle. Some details specific to the stochastic setting were moved to Appendix F due to space constraints. We will restore more of this material to the main text thanks to the additional page that is allowed for the camera ready.
The asphericity of the action set is often not a fundamental constraint in the sense that a hyperelliposid action set could have large asphericity, but it is trivial to achieve the optimal O(\sqrt{T}) regret by operating in a linearly transformed space where the action set is a hyperball while constraint parameters G and R remain the same. Could the authors provide an example where the dependency of asphericity can not be resolved by extending [16][7] through this approach.
This is not always true, and here is why. Let be a positive-definite matrix, and let be an ellipsoid. Assume that , so that . Suppose we are in the OCO setting, where is the feasible set and the losses are -Lipschitz over ; that is, for all .
Now, if we apply a reparametrization to work on the Euclidean ball instead, the reparametrized losses we work with are given by . By the chain rule, the gradients of satisfy (assuming is differentiable for simplicity). Therefore, the Lipschitz constant of in the reparametrized problem can be as large as , which can be arbitrarily larger than for ill-conditioned . Here, is a measure of the asphericity of .
Since the regret bound scales linearly with the Lipschitz constant of the losses (see the answer to the first question), working with the reparametrized losses in this way can introduce a dependence on the asphericity in the regret bound. This issue with reparametrization holds more broadly (beyond the ellipsoid setting), as we discuss in the paragraph titled “Limitations of reparametrization” on Line 194.
[1] Dylan J Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, pages 6020–6030, 2017.
We thank the authors for their response. Two quick follow-up questions:
-
While the authors noted plans to correct the typos, it would be appreciated if the revised version of Definition 2.1 could be shared in the discussion as well. Although the intended meaning can be inferred, confirming the corrected form would be helpful, given its central role in the paper.
-
Regarding the clarification of the hyperelipsoid example provided by the authors, we would like to confirm that this analysis does not directly induce a necessary dependency of the loss on asphericity. To see why this is the case, we can always transform the action set with , which only increases the lengths of shorter axes of the hyperelipsoid. This would not violate the existing constraints for both G and R.
Thank you for your follow-up questions.
Here is the corrected definition of the separation oracle.
Definition 2.1. A separation oracle for a set is an oracle that, given , returns a pair (where is the unit Euclidean ball in ), such that:
- and if ; otherwise,
- and for all .
The hyperellipsoid case.
If we understand correctly, you are suggesting working with the reparametrized function instead of the one we originally proposed, . In this case, the norm of gradient of would indeed be at most , which is great. However, the domain of would now be instead of just (we explain why below). Since the regret bound scales linearly with the product of the Lipschitz constant and the radius of the domain (as mentioned in my initial comment), the factor would reappear in the bound; because the radius of the reparametrized set is now .
The reason the domain of must be is that we want the reparametrization to preserve a one-to-one correspondence between the original set and the reparametrized set. To determine the mapped set under your proposed reparametrization, we apply the transformation to the original ellipsoid , which yields a Euclidean ball of radius .
We thank the authors for the clarification. However, recall that the initial action set is given by , we already started with an outer radius of . So even after transformation, we have an action space of radius that could be greater than 1, its upper bound remains the same, which is not increased, and the same analysis for should be applicable since both and do not have to be changed (this is what I meant by "This would not violate the existing constraints for both G and R.").
That is a fair point. Given the closing deadline for the author response, we’d like to end on the following point: while computing the appropriate reparametrization for the ellipsoid was cheap, doing it for general sets can cost O(d^4) calls to a separation oracle, making it prohibitive even if done once. (We mention this in the paper on line 194)
Overall the opinion of the authors about this paper is positive. We had long discussions with the authors and I went over the paper myself. It is agreed that the paper makes an interesting contribution to the field of projection-free OCO, which received significant interest in recent years. I urge the authors to fulfill their promise that if the paper is accepted the discussions regarding previous projection methods and the limitations of this work would be better clarified: When comparing to previous work it is highly important to also include the runtime (and even memory requirement) and the dependence of regret bound on the dimension, e.g., when comparing with Hazan & Kale 2012 and other works. It may also be interesting to discuss the comparison to projection-based methods in terms of the overall time to reach epsilon average regret. This is so important, since this method (which improves upon a recent similar result), while is still so-called "projeciton-free" is becoming more and more computationally demanding in terms of runtime (beyond oracle implementation) and memory and has regret bound which scales with the dimension.