No-regret Learning with Revealed Transitions in Adversarial Markov Decision Processes
摘要
评审与讨论
Summary: This paper considers the setting where the rewards and transitions are adversarial, and the transition kernel will be revealed to the agent at the end of each round. In this new setting, authors introduce the concepts of smoothed MDPs and smoothed regret. In particular, the smoothed MDP is defined as a transition kernel aggregating with previously experienced ones. The authors propose an OMD-based algorithm, called SOMD, to achieve smoothed regret bound for full-information feedback. Further, SOMD is extended to handle the bandit feedback.
优点
-
This paper introduces interesting concepts of smoothed MDP and smoothed regret. The new transition model is particularly interesting since it subsumes two models studied in prior works as special cases. I appreciate the authors’ efforts to highlight their connections.
-
The authors develop new algorithms for the smoothed MDP with full information and bandit feedback. Theoretical guarantees are provided w.r.t. smoothed regret and other regret by translating their smooth regret bounds.
-
The description of algorithms is clear and easy to follow. The related work is also well categorized.
缺点
-
The transition is revealed at every round is a pretty strong assumption. This assumption significantly simplifies AMDPs problem since the main challenge is to deal with an unknown and adversarial transition model. In this case, the algorithm does not handle the varying transitions but instead exploits the revealed transitions.
-
The paper does not provide a clear motivation for considering revealed transitions. For example, in the abstract, there seems to be no direct link between the first two sentences. While the problem of adversarial transitions and rewards is central to learning AMDPs, it is not immediately clear why this motivates the study of a weaker model with revealed transitions.
-
The algorithmic contribution seems somewhat limited, as similar ideas have been explored in previous work, particularly in the context of [1].
[1] Mirror Descent Meets Fixed Share (and feels no regret)
问题
-
Are there any motivating examples to consider the revealed transition model?
-
It looks like the main difficulty to acquiring any meaningful bound when transitions are not revealed is the function is not specified. It could be interesting to try to get some bound by letting the function class (of ) be as large as possible, and the transitions are unknown.
This paper considers the MDPs where rewards and transitions are both adversarial. While it is statistically or computationally impossible to achieve sub-linear regret in general, this paper considers the case where we are given full-information feedback on the adversarial transitions and the transitions allows a low variation w.r.t. some reference transition model.
By pretending that the transition model is the average of all historical transitions and perform an EXP3 on the occupancy measure space induced by this "smoothened" transition model, the authors yield algorithms ensuring regret in both full-information or bandit feedback models (for rewards).
优点
- The idea of utilizing feedback on transitions to tackle the time-varying transition model is intuitive and well-explained.
- The results enjoy a nice dependency on both and .
缺点
While, as I said in the Strengths, the idea of utilizing feedback on transitions to tackle the time-varying transition model is intuitive and well-explained, it also makes me unsurprised about the algorithms and the results. Given that , it is straightforward that the average transitions would converge to the reference transition and thus allowing good approximation effect.
Besides the smoothened OMD & regret part, the other techniques (like occupancy measures or implicit exploration) are pretty standard for adversarial MDPs. In terms of the smoothened part -- because of the well-established research on occupancy measures -- it is also not hard to translate the smoothened regret to the standard regret. I also do not see how this idea can be generalized to more challenging problems where the transition feedback are imperfect or the variation metric are different, or to other RL theory problems.
Therefore, at least based on the current main text, this paper lacks technical contributions to be of general interest. The problem setup, albeit indeed new and different from previous ones, is also insufficient to make this paper stand out. This is the reason why I vote for a reject. If I am missing something, please let me know and I will be happy to re-evaluate this paper.
问题
Please refer to the Weakness part.
This paper studies learning MDPs with adversarially chosen losses and transitions. The loss could be either full-information or bandit feedback, but the transition is revealed at the end of each round. The authors propose smoothed transition error as a more general measure for the changing transitions and derive sublinear regret bound with an additive term of . A new regret decomposition and a new technique smoothed OMD is proposed to achieve this.
优点
- This paper studies an important online learning problem where both reward and transition can be adversarially chosen.
- This paper proposes a new measure of changing transitions and shows that it is more general than previous measures.
- This paper proposes a new technique smoothed OMD to help the analysis.
缺点
- Compared with [Jin et al.,2023], this paper requires the transition to be revealed at the end of every round, which is a strong assumption.
- While the paper introduces an additional assumption and presents a more general framework, it appears to offer limited new learnability results compared to [Jin et al., 2023]. The current interesting regret results are all based on the average smoothing function, which leads to almost the same regret compared with [Jin et al., 2023] given Example 3.2. Although Theorem 4.1 provides a general regret bound, it is unclear whether this general theorem can lead to interesting sublinear regret with refined transition error measure compared with [Jin et al., 2023].
问题
-
If we choose other smoothing functions beyond the average one, is it possible to get a regret with a better additive transition error measure compared with [Jin et al., 2023]? For example, is it possible to achieve this with the smoothing function in Example 3.1?
-
Could the authors provide more technical insights about the reason for choosing smoothed regularization at Line 4 of Algorithm 1?
This paper studies the problem of learning an adversarial MDP with revealed transitions. The authors propose a notion of smoothed MDP based on a sequence of generic function , and devise Smoothed Online Mirror Descent (SOMD) to reach regret, where describe the variation of the transition model measured by .
优点
The SOMD algorithm could cooperate with different smooth functions, which is more flexible than the classical OMD algorithm.
缺点
-
The overall contribution is incremental in the aspect of RL theory, given that (Jin et. al., 2023) had established a similar regret bound without transition information.
-
It is not clear how to choose the smooth function efficiently to minimize the regret even if assuming the transition is revealed by the end of each episode.
问题
I did not go through the details but it seems weird if is a function of . If so, we can let by choosing . So could I guess might be a function of ?
We thanks the Reviewers for the time spent for reviewing our paper. We decided to withdraw our submission and to strengthen our work.
Thank you again,
the Authors.