Regret measure in continuous time limit for a stochastic Multi-armed bandit problem

Sabrine Chebbi,Sofien Dhouib,Setareh Maghsudi

OpenReview PDF

提交: 2024-09-25更新: 2025-02-05

摘要

关键词

Stochastic multi-armed banditRisk-sensitive regretHamilton-Jacobi-Bellman equationContinuous time-limit

评审与讨论

审稿意见

评分: 3置信度: 32024-10-26

This paper studies a class of stochastic multi-armed bandit problems with a risk-sensitive regret measure within a continuous limit setting

优点

Considering continuous-time limit of regret measures in continuous time.

缺点

The presentation is not clear.

The paper's contribution and the significance of the problem are not clearly articulated in the Introduction and the main text.

The English in the paper could benefit from some further refinement or editing to enhance clarity and coherence.

问题

what is the main contribution of the paper?
what is exactly the problem studied?
why studying the continuous-time limit is relevant for bandit problems?
How should we interpret the main result Theorem 1 and understand its practical relevance?

审稿意见

评分: 3置信度: 42024-10-26

This paper considers the traditional multi-armed bandit problem with a new risk measure. The authors continuize the time through rescaling and use PDE to find the optimal policy. In the meantime, the authors use some simulations to verify their results.

优点

The way to convert the MAB problem to a PDE problem is interesting and meaningful. The work compares different concepts, like frequentist and Bayesian settings making it easy to understand the applicability of the method.

缺点

The writing needs to be improved. There are a lot of typos which make it hard to understand the paper.
There are no real-world applications provided by the author regarding why this new risk measure is important, reducing the credibility and impact of the paper.
The usage of MDP seems improper. In your setting, $\nu$ seems to be fixed and only $s$ and $q$ are changing. However, there is no need to learn the transition kernel as if you choose an action $a$ , corresponding $q$ will be increased by 1. Then, it reduces to learning the reward function which is the same as in traditional MAB literature and so people usually don't call it MDP. It's more reasonable to use your framework to consider the case that $\nu$ is varying and say it's MDP.
The notations are messy. For example, why $V_{i+1}$ only relies on $R_i$ ? And you use a very strong assumption but only hide it in the Lemma 1.
The Theorem 1 is unclear. What is zero? Why do you use a bracket but link it to nothing?
In your numerical study, how do you implement UCB and TS? Do you adjust their definitions of regrets to your new risk measure? If not, they are not comparable. Otherwise, it's better to mention how you set the baseline in detail.

问题

See Weaknesses.

审稿意见

评分: 1置信度: 12024-10-29

This paper aims to analyze multi-armed bandit problems using differential equations and introduces a new risk measure for the analysis.

优点

I am unable to provide a comprehensive scientific review of the paper, and thus I cannot identify specific strengths. Please refer to the weaknesses below.

缺点

The paper has significant issues with presentation. Not only are there numerous grammatical errors, typos, and punctuation mistakes, but many sentences are incomplete and seem disconnected from the surrounding context. Additionally, the writing lacks a clear logical flow, making it difficult to follow the argument.

Furthermore, it appears that the authors have not adhered to the official ICLR style guidelines.

Due to these issues, I am unable to provide a more detailed review.

问题

Please refer to the weaknesses.

AC 元评审

2024-12-17

The reviewers struggled with identifying the high-level approach of this paper. The ratings are overly harsh in my opinion but the paper definitely needs a thorough revision to get it into a publishable state. That includes simple things like polishing the text, but it also would help to motivate and highlight the contributions of the paper early on.

审稿人讨论附加意见

Unfortunately, no rebuttal was submitted by the authors.

最终决定Reject

2025-01-22

Reject