Orthogonal Function Representations for Continuous Armed Bandits
In this paper we use orthogonal functions to design two efficient algorithms for the countinuous armed bandit problem.
摘要
评审与讨论
This paper studies the continuum-armed bandit problem, which is an extension of the traditional multi-armed bandit. Specifically, the authors propose an explicit representation using an orthogonal feature map (e.g. based on Fourier, Legendre functions) to transform the original problem into a linear bandit with misspecification. And the authors develop two algorithms named OB-LinUCB and OB-PE and use a suite of simulations to verify the efficiency of the proposed algorithms.
优点
Most parts of the paper are quite clear and easy to follow.
Based on my knowledge it is new to use orthogonal function bases to transform the continuum-armed bandit into misspecifed linear bandits. The simulations also showcase the high efficiency of proposed algorithms.
I haven't checked all details of the proof in the Appendix, but I feel they should be correct. I will refer to other reviewers' opinions as well.
缺点
-
Nowadays, the study on continuum-armed bandit also focuses on the general metric space. (e.g. Zooming algorithm mentioned in your work studies arbitrary space with any distance). Could you extend your algorithm to general metric space? I feel it should be possible if we can construct a decent orthogonal basis on any metric space, is that true? But how to construct the function base is required but unknown.
-
I agree it is relatively hard to explore multidimensional space , but I feel the current results in this work on multidimensional space are still weak. From Table 1 the proposed OB-PE can achieve non-optimal regret bound under the condition , which is not up to the state-of-the-art literature in this area. For implementation, both the number of arms and the dimension of arms would increase exponentially, and hence I am not sure whether it can still perform well in multidimensional space.
-
Without knowing the value of , I am not sure if the proposed algorithm still works since some settings (e.g. ) rely on the value of .
-
For the presentation of Algorithm 1, why don't the authors specify how to discretize the arm space directly there? It seems that discretization is unavoidable for OB-LinUCB and OB-PE.
问题
Besides my concerns in the above Weaknesses section, could you also answer my question about your experiment as follows:
-
It seems that you only use OB-LinUCB which is lack of strong theoretical support in your experiment. Why don't you use OB-PE as well since it has some good theoretical property instead? It seems there is some inconsistency between theory and experiments.
-
I may overlook, but how many arms do you choose in experiments for OB-LinUCB? I think you discretize the interval [0,1] evenly (maybe with arms). Do you have any theoretical support for OB-LinUCB when discretizing the arm set? I think this is very important and should be illustrated in your main paper clearly.
-
Some experiments one high-dim space (e.g. [0,1]^2) will make your experiments more reliable. I am concerned on the computational issue of the algorithm since the number of arms and dimension may explode exponentially.
Continuum-armed bandit on the general metric space. Unfortunately, generalizing this approach to that setting would be highly nontrivial, as polynomial and trigonometric features are not even defined on arbitrary metric spaces. On the other side, the main competitor of this algorithm, UCB-meta-algorithm, suffers the same issue, also because the concept of "differentiability" itself cannot be trivially defined.
Computational performance in multidimensional space. Please see the common answer about computational complexity.
Does the algorithm work without knowing the value of ? Please see the common answer about adaptivity.
For the presentation of Algorithm 1, why don't the authors specify how to discretize the arm space directly there? In fact, discretization is unavoidable for all the algorithms considered in the state of the art, the only exception being Zooming, which without discretization is NP-Hard. Therefore, to grant a fair comparison, we preferred not to specify the discretization to show in Table 1 how the performance of every algorithm depends on the number of its elements. Indeed, it is true that in the general case, a uniform cover would be the best choice, but also emphasizing the dependence on the cardinality of a general discretization is important.
Use of OB-LinUCB rather than OB-PE in the experiments We made this choice based on the principle that, usually, algorithms that optimize the worst-case regret are not the best option in practice. Coherently with this choice, we have compared our algorithm not with BPE, which achieves the best regret in kernelized bandits, but with IGP-UCB, which is well-known to have superior performance in practice regardless of the worse regret bound. Still, since also the practical performance of OB-PE is an interesting question, we are going to perform the experiments with this algorithm and upload them in a new version of the paper before the end of the discussion period.
Discretization in the experiments As the reviewer has understood, we discretize the interval [0,1] evenly with arms. Unfortunately, OB-LinUCB is not endowed with a regret bound for this setting, and it is only shown as the most natural algorithm that can be built with orthogonal function representation.
Some experiments one high-dim space (e.g. ) will make your experiments more reliable. I am concerned on the computational issue of the algorithm since the number of arms and dimension may explode exponentially. First point: Being the algorithm valid for , it is indeed a good idea to add an experiment in case of a bi-dimensional environment. We will do it and upload it in a new version of the paper before the end of the discussion period. Unfortunately, it will not be possible to include also the baselines used in the current experiment, as their running time is too high, and we cannot have the results in time. Second point: See the common answer about computational complexity.
We thank the reviewer for their comments and have hope to have addressed all their questions.
I'd like to appreciate authors' responses to my questions and the additional experiments made for OB-PE. I believe this work has its merit and some extensions such as to general metric space is highly non-trivial. And the idea of introducing orthogonal feature mapping is new and interesting. However, I still have concern on some presentation of the work: I feel it would be better to mention how to discretize the space explicitly in the pseudocode in detail. And the computational benefit of the algorithm is not very general. Therefore, I feel this paper is right on the borderline, and I will keep my score for now.
The paper resolves the continuous arm bandits where the reward function is smooth. By introducing orthogonal Fourier and Legendre feature maps, the paper shows that an approximation of the smooth function is possible even when the feature map is not available to the learner. The proposed algorithm achieves the nearly optimal regret bound in dimensional cases and in dimensional cases where the reward function is analytic. Empirical results show fair performance of the algorithm with computational efficiency.
优点
The paper eliminates the assumptions on the reward functions in bandit problems with continuous arms and without Gaussianity, by constructing an orthogonal feature map. The proposed algorithm is principled and computationally efficient for estimating general reward functions. Helpful discussion on the hardness of deriving optimal regret bound on multidimensional arms navigates the future work direction.
缺点
(a) In Theorems 3-5, the choice of requires the knowledge of . (b) There is a large gap between the choice of in theoretical results and empirical results. The performance and the computational time of the proposed algorithms seem to be heavily affected by the choice of . Numerical results and discussions on different choices of seem missing.
问题
Q. Could the algorithm be modified to execute without knowing a priori? Q. How could we choose when we do not know the true reward function? If we choose as in Theorems 3-5, would computation be heavy? Q. How does the choice of affect the performance and computational time of Fourier UCB and Legendre UCB? Q. In Figures 1 (b) and 2 (b), why do Fourier UCB and Legendre UCB perform worse when they use only even functions and the true reward function is even?
Adaptivity for unknown and how to choose without . See the common answer about adaptivity.
Choice of in the experiments. Keeping the values of in the experiments disconnected from theory was a deliberate choice to show that the requirement to know the differentiability of the reward function exists only in theory. In fact, having chosen a fixed value of for functions with very different levels of regularity suggests that this parameter can be chosen in a totally empirical way and tuned very easily.
Computational complexity dependence on . This question is linked to the theme of computational complexity, therefore, we suggest the reviewer to also see the "Computational complexity" section in the answer to all reviewers. In short, the computational complexity scales as , which means , where is the number of features. This is unavoilable while using algorithms for linear bandits, as inversion of a design matrix is necessary. Still, note that the algorithm chooses in a way that , so that the full computational complexity does not explode.
In Figures 1 (b) and 2 (b), why do Fourier UCB and Legendre UCB perform worse when they use only even functions and the true reward function is even? We thank the reviewer for noticing this subtle phenomenon. The reason stays in the true reward function: as it is a polynomial of degree 4, including in the feature map even polynomials of higher degree is as useless as including odd functions. In fact, as high-frequencies are more prone to overfitting the noise, the effect of including these features may be even worse than adding other useless components, as in this example.
We thank the reviewer for their comments and insightful observations and hope to have addressed all their questions.
Thank you for the response. It addresses most of my questions. But still, I have some questions on Q. Could you explain why can be tuned very easily? Q. If the algorithm chooses in a way , is the condition N$ is chosen empirically, does the theoretical regret guarantee still hold?
Indeed, the fact that is tuned very easily is only empirical, and in case of using an different from the one with theoretical guarantees the only possible regret bounds are analogous to the ones obtained for a misspecification of discussed previously.
In fact, there is a phenomenon similar to what happens with Fourier series: despite the theory telling us that Fourier series converge only under very specific hypotheses, their approximation power in practice is universally recognized. The reason is that the counterexamples of functions for which Fourier series converge slowly, or do not converge at all, are often “pathological” functions with characteristics that are never observed in reality, given that the functions used in real problems are usually piecewise regular. This phoenomenon also applies to Legendre polynomials, as we observe in the experiments : since
- what really matters in continuous armed bandit algorithms is to estimate the function near the maximum
- the convergence of Fourier series/Legendre polynomials for piecewise regular functions is very rapid (unless one is near a point of discontinuity)
excellent results can be obtained even for values of much smaller than those for which there are theoretical guarantees. This means that by taking a reasonably small value of , our algorithm exhibits excellent results on regret with exceptional computational complexity.
This paper studies continuum arm bandits with a generic smoothness assumption on the reward being times continuously differentiable. The main new idea of the paper is that by using a finite representation of the reward function in terms of a known orthogonal basis for the function class, one can reduce the problem to misspecified linear bandits. This leads to an algorithm with optimal regret in some regimes and better computation time than prior algorithms in the literature.
While I think the idea of this paper is novel, the theoretical regret bounds are only optimal in some regimes and the overall benefit especially over the cited algorithm UCB-Meta-algorithm is unclear.
优点
- The paper is easy to read and well-written despite being quite math heavy, and I found the main algorithmic idea straightforward to follow.
- The thorough comparison with prior works in Section 5.3 was helpful for placing this result in context with the rest of the literature.
缺点
- It seems like the theoretical regret upper bounds require tuning with knowledge of the underlying reward function's smoothness . This seems like an unrealistic assumption to me. Can the authors comment on misspecified or adapting to unknown ?
- In order to even use the orthogonal features (at least for dimension ), it seems like one also needs -th derivative of to be square integrable. Is this a reasonable assumption to make for common reward functions in stochastic optimization? Some discussion on this would be nice.
- Looking at Table 1, it seems like the UCB-Meta-algorithm attains the optimal regret, while the main procedure of this paper OB-PE does not get optimal regret for . The paper claims UCB-Meta-Algorithm has no regret guarantee for infinitely-differentiable rewards, but I don't see why I can't just run UCB-Meta-Algorithm with very large Holder exponent which should get a regret bound which is nearly acccording to bounds of Liu et al., 2021. Thus, the only real advantage of OB-PE seems to be in time complexity.
- There is a discrepancy between experiments and theory in the sense that the paper uses two different algorithms OB-PE vs OB-LinUCB for theoretical regret bounds vs experiments. It's not explained why OB-PE is not implemented or analyzed in experiments. Some explanation on this would be nice.
问题
Questions
- See above in "Weaknesses".
- What is the dependence of the regret bounds of OB-PE on the Lipschitz constant ? In Lipschitz bandits literature, this is well known to be , but it is not clear to me what dependence appears here.
- Theorem 4 for the dimensional regret upper bound does not seem to have any condition on the -th partials of , which seems wrong to me since Theorem 3 required it. Why is this?
Writing Suggestions/Typos
- The regret formula in the first paragraph should have the terms reversed in the difference.
- In the fifth paragraph of page 1, the domain of should be ?
- I was confused why initially the reward function has unbounded scale, yet all the regret bounds seem to be scale-free. This is because the paper later on assumes , i.e. assumes knowledge of the scale of . It might be better to just define bounded reward from the outset to not mislead readers.
- Many environment references throughout the writing (e.g., algorithm 1, appendix 1, equation (1)) should have their names capitalized (e.g., Algorithm 1).
- In Section 4.2, the Lipschitz constant is used before it is defined.
- In the writing, "s+1-th derivative" would read better as "(s+1)-th derivative".
Adaptivity on . See common answer about adaptivity.
Assuming square-integrable derivative in practice. The setting of stochastic bandits has achieved relevant success in the field of pricing. In this area, bandit algorithms are used to optimize a function of the price given by where is an unknown function known as "demand curve", which is non-increasing in (ref. "Pricing the long tail by explainable product aggregation and monotonic bandits" by Mussi et al.). Of course, the differentiability of corresponds to the one of , the demand curve, and since this function is decreasing and bounded in , its derivative cannot be "too large too often". This can be formulated by asking that its derivative is square-integrable.
Infinitely differentiable functions. We agree with the reviewer that, being included in for every , the algorithm from Liu et al., 2021 can achieve regret of order for every when the function is infinitely differentiable. The point is that UCB-Meta-Algorithm has to choose a number of features that depends on and diverges for (that algorithm relies on dividing the actions space in hypercubes and running a linear bandit algorithm in each hypercube. We refer to the number of features used by each of these linear bandits). Conversely, our algorithm uses fewer features the higher the value of , and in particular, for , it can achieve regret of order using the lowest number of features. This strange phenomenon makes the two algorithms complementary, with UCB-Meta-Algorithm being more suitable for large and small , and our one for or .
Use of OB-LinUCB in the experiments. We made this choice based on the principle that, usually, algorithms that optimize the worst-case regret are not the best option in practice, where simplicity works much better. Coherently with this choice, we have compared our algorithm not with BPE, which achieves the best regret in kernelized bandits, but with IGP-UCB, which is well-known to have superior performance in practice regardless of the worse regret bound. In the end, if ever one of our algorithms has a hope to be used in practical scenarios, this is probably OB-LinUCB, which is the most natural marriage between a Linear Bandit algorithm and the orthogonal function representations. Still, we will implement OB-PE and let the reviewer have its result by the deadline of the discussion period.
Dependence of the regret bounds of OB-PE on the Lipschitz constant . Thanks for the question, which opens up some reflections we hadn't considered. Our regret bound is roughly of order where is the Lipschitz constant of the derivative. With the optimal choice we get, in the regime where we have regret guarantees, meaning that the dependence is (which of course becomes vacuous for , when also the regret is unbounded).
Assumptions of Theorem 4. The different type of assumption reflects the difference in the type of analysis behind the result. In fact, given that the theory of Legendre polynomials has so far been developed only in one dimension, it is not possible to use it in the general case, and the argument used there does not make assumptions on the -th derivative. This is linked to the reason why our analysis is sub-optimal for .
We thank the reviewer for their comments and for identifying some typos in our paper. We hope we have addressed all of their concerns.
Thank you for the clarifying response. While I originally believed the main benefit of the proposed algorithm was in reduced computational complexity, reading the other discussions and general rebuttal, it seems like even that comparison is only superior in certain regimes of . Overall, I think the "reduction" of infinite-armed bandits to linear bandits is interesting, but the overall benefit still seems to be lacking. As such, I will keep my score the same.
Consider a continuous arm bandit problem where the action space is . While Reproducing Kernel Hilbert Space (RKHS) feature maps have been used previously, they present computational difficulties. The authors propose an approach that uses an orthogonal feature map to convert the problem into a linear bandit problem. This approach not only offers competitive regret guarantees compared to existing methods, but also reduces computational complexity.
优点
An analysis of linear bandit algorithms under misspecification is utilized for continuous armed bandits, proposing an algorithm with a small computational complexity.
It demonstrates competitive performance numerically as well.
缺点
A significant portion of this paper seems to be dedicated to presenting classical results. While understanding the foundation is essential, I would encourage the authors to consider focusing more on their unique contributions to the field.
From my understanding, the proposed algorithm/analysis appears to present a solution for an efficient trade-off between misspecification (bias) and regret (variance) in the continuous arm bandit problem. However, the contribution seems to lie primarily in combining these elements, and I did not find the approach significantly novel. If there is technical novelty, it should be better presented and emphasized. The paper does not clearly address the potential for the value to become significantly large (though not infinite). In such cases, the computational advantages of the proposed method compared to existing methods may not only be positive but could potentially be negative. This issue needs to be addressed.
The algorithm seems to require a substantial amount of inputs (values dependent on , the choice of basis, etc.). How much does this limit its adaptability? Are there methods that are agnostic to or ? Or is this level of input requirement comparable to existing research? These questions are crucial when discussing the adaptability of the algorithm.
Considering the high standards of ICLR, it is unclear whether the contributions of this paper are significant enough to warrant acceptance. The authors might want to better highlight the novelty and impact of their work.
问题
As I also wrote in the previous section:
- What is the technical novelty of the proposed algorithm/analysis? How is it emphasized?
- Can you explain how the proposed method addresses the potential for the d/s value to become significantly large?
- To what extent does the substantial number of inputs required by the algorithm limit its adaptability?
- Are there methods that are agnostic to T or s?
- How does the level of input requirement in this paper compare to existing research?
Paper organization The theory behind our algorithm is highly nontrivial for someone with a background different than mathematics. Therefore, we have decided to keep in the main paper the parts that are essential for comprehending the paper and leave the more advanced and subtle discussions in the appendix.
Novelty We acknowledge the reviewer's observation that our analysis relies heavily on theorems already established in the field of functional analysis. However, we disagree with the notion that this diminishes the novelty of our work. On the contrary, given that existing literature predominantly employs Kernel methods or focuses on finding a cover for the arm space, the utilization of a theory previously unexplored in this context should be regarded as a valuable contribution. Recent research in bandits on Reproducing Kernel Hilbert Spaces (RKHS) has revealed that performance guarantees are intricately tied to the decay properties of the eigenvalues of the kernel. In our study, we demonstrate (refer also to appendix E.2) that the same holds true for general continuous armed bandits. This suggests that, even if certain results in our paper fall short of optimality, the methods formulated here could be applied to extend many findings from the extensive RKHS theory to this specific setting.
Computational advantages where the d/s value to become significantly large See the common answer about computational complexity.
Adaptivity for values of and See the common answer about adaptivity.
Input requirement in this paper compared to existing research All the related papers, like ours, assume knowledge of both the time horizon and the order of differentiability. In particular, the four works on Bayesian optimization also presume working in a Reproducing Kernel Hilbert Space (RKHS) with a known kernel. Since the knowledge of the kernel directly determines the order of smoothness of the functions considered in the RKHS, their requirement is notably more stringent than ours.
We thank the reviewer for their comments and hope to have addressed all their questions.
Thank you for your response, but I'm still not clear on why it's novel to combine the results of the misspecification bandit and the functional analysis. If it's simply a matter of combining the results, it seems far from the standards of ICLR. Were there any challenges that arose from combining the existing results? If so, what were the steps you took to overcome those challenges?
Since our article is presenting a new line of research that transcends the two strands to which all the remaining literature on continuous armed bandit belongs, we are perplexed by the fact that the novelty is being contested. If the point pf the reviewer is about the technical difficulty of the proofs, here we have a non-exaustive list of the challeges that we faced:
- Using a standard linear bandit algorithms leads to a regret bound of the form , while to achieve the regret bound we needed . This problem was solved by using a particular algorithm for linear bandits which was very recently invented, and an argument which is explained in the main proof.
- For practical bandit algorithms, other polynomial bases are often used to do representation learning, such as Bernstein polynomials. Instead, our argument uses the orthogonality property of Legendre polynomials, which leads to a form which is good for linear bandit algorithms trough the use of Parseval's theorem.
- The proof in the three cases () are all different since there are holes in the mathematical theory that do not allow to use the same proof strategy in the three cases. In particular, the last case, which is not explicitly faced in the rest of the literature, has proof which do not relate with pre-existing results.
Still, is this the real aim of a Machine Learning paper? The core of our work is to propose a new approach based on a kind of theory that was never used in the Bandit setting. Incorporating new results from contemporary mathematics to create new algorithms should be considered something valuable: even [1], one of the most celebrated papers of the whole bandit literature, is substantially built on the idea of applying self normalized processes [2] to linear bandits.
In the end, it may seem that our paper lacks novelty since the proposed algorithm is a simple modification of existing ones. However, we think that, on the contrary, proposing a simple solution that exploit known mathematics to the fullest is preferable than re-inventing the wheel. In the end, quoting someone that was quite recognised to his novelty, simplicity is the ultimate sophistication.
[1] Y Abbasi-Yadkori, C Szepesvári, D Pál, "Improved algorithms for linear stochastic bandits"
[2] Victor H de la Pena, Tze Leung Lai, Qi-Man Shao, "Self-normalized processes: Limit theory and Statistical Applications"
Since some points of paramount importance were raised by multiple reviewers, we decided to give a common answer in this section.
Computational complexity is a crucial issue in this kind of machine learning application, and one of the most important features of our work is the limited complexity of our algorithm. While for the result of the experiments shows unmistakable evidence that the running time is enormously reduced with respect to the state of the art, questions have been posed over computational complexity for .
-
Discretization. Although it is true that the optimal discretization has roughly elements, a number exploding with , this is an unavoidable consequence of the curse of dimensionality. Every algorithm in state of the art requires to pass through a similar procedure: UCB-Meta-algorithm needs to discretize both the action space into bins and then discretize each bin to solve the linear bandit problem; the four Bayesian Optimization baselines require discretizing the action set to find the maximum of the posterior; Zooming needs to start from a discretization too, otherwise it definition implies solving an NP-Hard problem. Table 1 contains the computational complexity of every algorithm with respect to both and the cardinality of the discretization, showing that our algorithm only scales as in the worst case, while for the others, at least in .
-
Dimension of the linear bandit problem. In multidimensional spaces, the dependence between the order of approximation and the dimension of the auxiliary linear bandit problem built by OB-PE and OB-LinUCB is . Nevertheless, with the optimal choice is , this does not exceed in the feasible regime . Considering that the computational complexity scales as , this term is bounded by , as reported in Table 1.
Adaptivity is a very relevant feature of every machine learning algorithm. In our setting, it is interesting to see if our algorithm is able to perform well in case information about either the time horizon or the order of smoothness is missing.
-
Adaptivity with respect to . Even if our algorithm requires the value of to compute the number of features, we can achieve a similar regret guarantee, except for a factor by first guessing a small value for , running the algorithm for horizon and then continuing doubling the time horizon () until is reached. In this way, the algorithm performs similarly, being agnostic of (see [1]).
-
Adaptivity with respect to . The other relevant parameter which may be unknown to the agent is . Let us assume the algorithm is run with . In this case, since , we have the regret bound corresponding to the lower smoothness parameter , so that
If instead the algorithm is run with $s_{\text{miss}} > s_{\text{true}}$, we have $$R_T \le \mathcal O\left (T^{\frac{4s_{\text{miss}}-2s_{\text{true}}+d}{4s_{\text{miss}}}}\right);$$ indeed, using a misspecified value $s_{\text{miss}}$ leads to a misspecified value for $N=\mathcal O(T^{\frac{1}{2s_{\text{miss}}}})$. Substituting this value into the regret bound depending on $N$, which is of order $$R_T \le \mathcal O\left (\sqrt{N^d T}+N^{d/2}N^{-s}T\right ),$$ we get exactly the previous bound. Note that in the experiments, $s$ **is already unknown**. In fact, we run the algorithm on all the environments with the same value for $N$ to show that choosing $N=\mathcal O(T^{\frac{1}{2s}})$ is not crucial in practice.
[1] Besson, L., & Kaufmann, E. (2018). What doubling tricks can and can't do for multi-armed bandits. arXiv preprint arXiv:1803.06971.
We have incorporated the experiments suggested by the reviewers into the paper. The results of these experiments are available in the appendix, F.4. Specifically, we have added experiments with OB-PE on the environments of the paper, as well as an experiment in dimension 2.
This is a borderline paper with one major criticism being that the approach is incremental and lacks novelty. Reading through the rebuttal I agree with the reviewers. For example, the authors highlight that performance guarantees are intricately tied to the decay properties of the eigenvalues of the kernel function. But this is long known in the RKHS literature in other contexts and minimax optimal rates are attained in settings like ridge regression by utilizing the eigenvalue sequences. It is not surprising at all that the very same properties also factor in in the bandit context.
为何不给更高分
It is a borderline paper and a weak accept is also an option. The reject is mainly based on the lack of novelty and the incremental character of the paper.
为何不给更低分
n/a
Reject