Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback
摘要
评审与讨论
The paper studies best-of-both-worlds algorithms for layered finite-horizon MDPs with aggregated bandit feedback. The proposed algorithm is efficient and obtains the following regret bounds, where is the number of rounds:
For known dynamics:
Adversarial losses - ; Stochastic losses - regret.
For unknown dynamics:
Adversarial losses - ; Stochastic losses - regret.
All rates are near optimal also in , i.e., states, actions, and horizon length.
It also presents regret bound for the Stochastic Shortest Path setup.
优缺点分析
Strengths:
- Well motivated setting.
- Near optimal regret bounds in the first BOBW algorithm that considers MDP with aggregated bandit feedback.
- Well-written paper.
- Novel optimistic approximation technique (although inspired by previous work on SSP).
- High potential to impact the community by extention of the approch to other related setups with aggregated bandit feedback.
Weaknesses:
- The dependency in is not optimal in some of the studied cases.
- The algorithm is not properly stated in the main paper.
问题
- Can you explain the general algorithm framework and provide pseudo-code?
局限性
N/A
最终评判理由
After reading the authors rebuttal, all my concerns are satisfied and I am supporting the paper's acceptance.
格式问题
N/A
We thank the reviewer for their review. We address their concerns below.
Reviewer’s comment: The dependency in is not optimal in some of the studied cases.
Author’s response: While we agree that the dependence on and is not optimal, the same holds for prior works in the standard feedback setting. Improving these bounds in both the standard and aggregate feedback settings remains an important open question.
Reviewer’s question: Can you explain the general algorithm framework and provide pseudo-code?
Author’s response: Our algorithmic framework has the following main components:
- Building the confidence set of the true state transitions.
- Building an optimistic loss estimator.
- Determine a policy using an FTRL with a suitable regularizer which contains both Tsallis-INF and log-barrier.
- Generate a trajectory using the policy computed and update the estimates.
The main novel aspect of our algorithmic framework is the loss estimator, which eliminates the need for the loss-shifting techniques used in prior work and avoids the additional scaling of the bonus term by a factor of . In addition, our approach generalizes the estimator of Maiti et al. (2025), originally developed for shortest paths in DAGs, to the more general setting of MDPs with aggregate feedback. For the detailed pseudo-code of our algorithm, we refer the reviewer to Algorithm 1 on Page 41 of the supplementary material.
Maiti et al. (2025). Efficient near-optimal algorithm for online shortest paths in directed acyclic graphs with bandit feedback against adaptive adversaries.
Reviewer’s comment: The algorithm is not properly stated in the main paper.
Author’s response: In the revised version, we will prioritize including the algorithm in the main body, space permitting, and adjust the surrounding discussion accordingly.
I thank the authors for their rebuttal and I will keep my score.
This paper propose the first study of best-of-both-worlds algorithms for episodic tabular MDPs with aggregate bandit feedback, i.e., where the learner only observes the cumulative loss incurred in each episode instead of the individual loss over each state-action pair of the trajectory. They define a loss estimator inspired by online shortest path problems, and use it on FTRL algorithms over occupancy measures. They treat both the cases of known and unknown transitions.
优缺点分析
This is an interesting paper for the following reasons:
- The paper is well-written and clearly structured.
- Adapting the loss estimator from online shortest path problems to episodic MDPs with aggregate feedback is novel and enables the development of BOBW algorithms, and potentially other algorithmic approaches as well.
- Rather than simply combining two existing results (the loss estimator from Maiti et al. and the FTRL BOBW approaches from Jin et al.) the authors also have to address the additional challenge of unknown transition probabilities. Ensuring the optimism of the loss estimator in this setting seems non-trivial and the approach they use in section 3.3 could have broader applications.
Aside from the limitations already outlined in the conclusion as directions for future work (generalizing to policy optimization algorithms to avoid computational complexity), I don't see any major weaknesses to highlight.
问题
I have a question regarding the final result:
- What specific aspect of the loss estimator allows to avoid scaling the bonus term by an additional factor of ? I ask because, in episodic MDPs, methods that incorporate bonus terms are typically known to incur an extra factor in the regret bound compared to approaches based on confidence bounds over occupancy measures because of this scaling factor in the bonus definition. Does this suggest that in the standard episodic MDP setting a similar loss estimator could be used to eliminate this additional term in the regret bound while keeping the same dependencies on the other MDP parameters?
局限性
yes.
最终评判理由
After reading the other reviews and responses I will keep my original positive score. The authors have addressed all of my questions, and given the strengths highlighted above, I believe this is a interesting work.
格式问题
The paper seems to follow the formatting instructions.
We thank the reviewer for their review. We address their concerns below.
Reviewer’s question: What specific aspect of the loss estimator allows to avoid scaling the bonus term by an additional factor of ? [...] Does this suggest that in the standard episodic MDP setting a similar loss estimator could be used to eliminate this additional term in the regret bound while keeping the same dependencies on the other MDP parameters?
Author’s response: In Eq. (9) on page 7, the presence of the terms in the numerator of the fraction and after the fraction eliminates the need for loss-shifting techniques used in prior work, thereby avoiding the additional scaling of the bonus term by a factor of . Our bounds also improve upon the instance dependent regret guarantees by Jin et al. (2021), who considered the standard episodic MDP setting.
Jin et al. 2021. The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition
I thank the authors for addressing my question. After reading the other reviews and responses, I’ve decided to maintain my score as an accept.
The paper studies online learning in finite-horizon episodic MDPs under aggregate bandit feedback, which means that on each episode, the agent only observes the aggregate loss across the entire episode. Prior work has provided regret bounds for the worst-case version of this setting, where the loss function each episode is chosen adversarially. In the stochastic version of this setting the loss function of each episode is chosen iid. This paper proposes a “best of both worlds” (BOBW) algorithm which yields regret in the worst-case setting and regret in the stochastic setting. They also provide nearly matching lower bounds. Lastly, they provide some regret bounds for online shortest path as a byproduct of the primary analysis.
优缺点分析
Strengths
The paper does a reasonable job of motivating the problem. The authors' work fills a gap in the existing literature and they provide a clear comparison (I especially appreciated Tables 1-3). The results all appear sound and well-justified.
Weaknesses
I did not find any significant issues with the paper. I think the presentation could be improved somewhat: the paper dives into pretty heavy math on page 5 without a lot of intuition. A proof roadmap or a definitions roadmap might be helpful here.
The primary reason I am not awarding a higher score is simply that nothing in particular stands out to me about this paper. The results are fine but not exceptional, the techniques are fine but not exceptional. The paper fills a very specific niche that I think has limited impact. To the extent possible, I would encourage the authors to justify why their paper is not “just another MDP online learning paper”.
I realize this may be frustrating to the authors since this is a very subjective assessment, but I cannot justify giving a higher score because I do not think the paper has “high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI”. Overall, I still lean towards acceptance though.
问题
If the authors believe that there is a reason I am underestimating the potential impact of their work, I am open to changing my mind. However, since this is ultimately a very subjective assessment, I think this is unlikely.
Minor comments:
- “TC” is not defined in Tables 1-3. I understand that it is defined later, but I think everything that is used in the table should be defined in the table
- Is there no way to simplify the expression on line 92?
- Lines 98-107: Why not instead assume that the aggregate losses are in [0,L]? That seems simpler
- Line 136: What is this expectation over? Isn’t a constant conditioned on ?
局限性
The “limitations discussion” is essentially just a normal conclusion section where the authors discuss technical open questions. I suppose this technically fits the requirements for a limitations discussion, but to me it misses the spirit of the requirement: if the expectation was simply to write a normal conclusion section, there would be no need to explicitly include the requirement for a limitations discussion in the submission guidelines.
最终评判理由
My assessment remains the same: the paper has no major issues but I do not think it is has “high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI” to justify an Accept rating.
格式问题
None
We thank the reviewer for their review. We address their concerns below.
Reviewer’s comment: If the authors believe that there is a reason I am underestimating the potential impact of their work, I am open to changing my mind. However, since this is ultimately a very subjective assessment, I think this is unlikely.
Author’s response: We thank the reviewer for their thoughtful assessment and constructive feedback. We acknowledge that parts of the presentation, particularly the transition into the technical sections, could benefit from more intuition and a high-level roadmap. We will address this in the final version.
Regarding the impact of our work, we address a fundamental gap in online learning for MDPs with aggregate bandit feedback, where the learner observes only the total reward over a trajectory. This setting is strictly more challenging than the standard one, where feedback is observed at each step, and it also arises naturally in many practical applications where intermediate rewards are inaccessible or costly to obtain. Existing methods either rely on policy-optimization approaches or linear bandit reductions, neither of which achieve best-of-both-worlds (BOBW) guarantees.
Our technical contribution is twofold:
-
Novel estimator and first FTRL algorithm: We build on the recent estimator of Maiti et al. (2025) for shortest paths in DAGs and generalize it in a non-trivial way to design the first FTRL algorithm for aggregate feedback MDPs. This directly leads to the first BOBW guarantees in this setting.
-
Simplified analysis and improved bounds: Unlike prior BOBW work in standard MDPs, which required complex loss-shifting techniques, our estimator avoids these techniques and yields a cleaner, more general analysis while improving the regret bounds.
We believe that introducing FTRL to the aggregate-feedback setting, achieving BOBW where prior methods failed, and simplifying the analysis via a more general estimator together lay the groundwork for future advances in online learning with partial feedback.
Reviewer’s comment: “TC” is not defined in Tables 1-3. I understand that it is defined later, but I think everything that is used in the table should be defined in the table.
Author’s response: We apologize for the confusion. “TC” refers to time complexity, and a checkmark indicates that an efficient implementation is possible. We will clarify this in the table for better readability.
Reviewer’s comment: Is there no way to simplify the expression on line 92?
Author’s response: Currently, we are not aware of any way to simplify the expression. Improving upon this remains an important open question.
Reviewer’s comment: Lines 98–107: Why not instead assume that the aggregate losses are in ? That seems simpler.
Author’s response: We assumed that the aggregate losses lie in purely for simplicity of presentation. To compare with the case, one can simply scale the regret bounds by . Moreover, as explained in Remark 1 of the supplementary material, choosing aggregate losses in (or when scaled) corresponds to a more general setting than assuming per-layer losses are .
Reviewer’s comment: Line 136: What is this expectation over? Isn’t a constant conditioned on ?
Author’s response: In the stochastic setting, is a noisy reward, and the expectation is taken over the randomness in the noise.
Thanks for the response. My assessment remains the same: the paper has no major issues but I do not think it is has “high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI”.
The paper proposes the first best-of-both-worlds (BOBW) algorithms for episodic MDPs with aggregate bandit feedback, where only trajectory-level losses are observed. The algorithms achieve regret in stochastic settings and in adversarial ones, both for known and unknown transitions. The algorithm works under the assumption that the state space is partitioned on a set of layers, and transitions from a layer are only allowed to layer .
优缺点分析
Strengths
- Although the paper is not in my area of expertise, I believe the paper is overall clearly written and the contribution seems significant.
Weaknesses
-
Lack of experiments. Even for a theoretical paper, I believe sanity-check experiments that test the main algorithmic novelties are important.
-
Clarity could be improved by moving the algorithm definition to the main text. In the current version, the main text focuses on explaining the ideas behind building the algorithm, but without clearly presenting the algorithm itself.
问题
Why is it necessary to assume that the MDP is layered?
局限性
Assumption of layered MDPs and lack of experimental evidence/sanity checks.
最终评判理由
As I said in my review, although the paper is not in my area of expertise, I believe the paper is overall clearly written and the contribution seems significant.
格式问题
None
We thank the reviewer for their review. We address their concerns below.
Reviewer’s question: Why is it necessary to assume that the MDP is layered?
Author’s response: Note that this assumption introduces no loss of generality: any finite-horizon MDP can be transformed into a layered one by replicating the state space across layers. We assume a layered MDP primarily to simplify notation and analysis. This is standard in the literature, for instance, in works like (Jin et al., 2020, Jin et al., 2021), and is often referred to as a loop-free stochastic shortest path problem.
Jin et al. 2020. Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
Jin et al. 2021. The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition
Reviewer’s comment: Clarity could be improved by moving the algorithm definition to the main text. In the current version, the main text focuses on explaining the ideas behind building the algorithm, but without clearly presenting the algorithm itself.
Author’s response: We thank the reviewer for this helpful suggestion. We agree that presenting the full algorithm in the main text would improve clarity. In the revised version, we will prioritize including the algorithm in the main body, space permitting, and adjust the surrounding discussion accordingly.
Thank you very much for the clarifications! After reading the other reviewers responses, I decided to maintain my rating.
All reviewers gave the paper positive evaluations. It is mostly well written and its theoretical results are strong enough for this venue. Given these strengths, I recommend that it be accepted.