Regret measure in continuous time limit for a stochastic Multi-armed bandit problem
摘要
评审与讨论
This paper studies a class of stochastic multi-armed bandit problems with a risk-sensitive regret measure within a continuous limit setting
优点
Considering continuous-time limit of regret measures in continuous time.
缺点
The presentation is not clear.
The paper's contribution and the significance of the problem are not clearly articulated in the Introduction and the main text.
The English in the paper could benefit from some further refinement or editing to enhance clarity and coherence.
问题
-
what is the main contribution of the paper?
-
what is exactly the problem studied?
-
why studying the continuous-time limit is relevant for bandit problems?
-
How should we interpret the main result Theorem 1 and understand its practical relevance?
This paper considers the traditional multi-armed bandit problem with a new risk measure. The authors continuize the time through rescaling and use PDE to find the optimal policy. In the meantime, the authors use some simulations to verify their results.
优点
The way to convert the MAB problem to a PDE problem is interesting and meaningful. The work compares different concepts, like frequentist and Bayesian settings making it easy to understand the applicability of the method.
缺点
-
The writing needs to be improved. There are a lot of typos which make it hard to understand the paper.
-
There are no real-world applications provided by the author regarding why this new risk measure is important, reducing the credibility and impact of the paper.
-
The usage of MDP seems improper. In your setting, seems to be fixed and only and are changing. However, there is no need to learn the transition kernel as if you choose an action , corresponding will be increased by 1. Then, it reduces to learning the reward function which is the same as in traditional MAB literature and so people usually don't call it MDP. It's more reasonable to use your framework to consider the case that is varying and say it's MDP.
-
The notations are messy. For example, why only relies on ? And you use a very strong assumption but only hide it in the Lemma 1.
-
The Theorem 1 is unclear. What is zero? Why do you use a bracket but link it to nothing?
-
In your numerical study, how do you implement UCB and TS? Do you adjust their definitions of regrets to your new risk measure? If not, they are not comparable. Otherwise, it's better to mention how you set the baseline in detail.
问题
See Weaknesses.
This paper aims to analyze multi-armed bandit problems using differential equations and introduces a new risk measure for the analysis.
优点
I am unable to provide a comprehensive scientific review of the paper, and thus I cannot identify specific strengths. Please refer to the weaknesses below.
缺点
The paper has significant issues with presentation. Not only are there numerous grammatical errors, typos, and punctuation mistakes, but many sentences are incomplete and seem disconnected from the surrounding context. Additionally, the writing lacks a clear logical flow, making it difficult to follow the argument.
Furthermore, it appears that the authors have not adhered to the official ICLR style guidelines.
Due to these issues, I am unable to provide a more detailed review.
问题
Please refer to the weaknesses.
The reviewers struggled with identifying the high-level approach of this paper. The ratings are overly harsh in my opinion but the paper definitely needs a thorough revision to get it into a publishable state. That includes simple things like polishing the text, but it also would help to motivate and highlight the contributions of the paper early on.
审稿人讨论附加意见
Unfortunately, no rebuttal was submitted by the authors.
Reject