PaperHub
4.9
/10
Poster4 位审稿人
最低2最高4标准差0.8
4
2
3
2
ICML 2025

Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.
关键词
CMDPsOnline Learning

评审与讨论

审稿意见
4

The authors study constrained Markov decision processes (CMDPs) and they provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback.

给作者的问题

Page 1: Can you please connect the study of constraint violation to the classical RL literature?

Page 2: Can you please elaborate on the notion of episode-dependent learning rate?

Page 3: In Algorithm 1, what are the inputs and the outputs?

Page 4: Can you please further explain Condition 2.5?

Page 5: Can you please explain Equation (2)?

Page 6: What is the output of Algorithm 3?

Page 7: Can you please elaborate on Lines 334--336?

Page 8: Why are you defining OPT^\mathcal{w} in this way?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes, to some extent.

实验设计与分析

Yes, to some extent.

补充材料

No.

与现有文献的关系

The main novelty is that the authors provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback.

遗漏的重要参考文献

No.

其他优缺点

N/A.

其他意见或建议

N/A.

作者回复

We thank the Reviewer for the positive evaluation.

[1]

While in classical RL literature, the only goal is to maximize a given reward function, in the constrained RL literature the problem becomes two-fold. Specifically, the agent has to additionally fulfill mm possible constraints while maximizing the reward function. This double objective strongly limits the possibility to explore the decision space.

[2]

As “episode-dependent learning rate” we mean a learning rate which changes over episodes. This is required to attain a linear dependence on the payoffs’ range in the primal algorithm regret bound. Notice that, while attaining a linear dependence on the payoffs’ range is trivial when the payoffs’ range is known, this is not the case when the aforementioned quantity is unknown.

[3]

Algorithm 1 does not have any input/output since it simply shows the interaction between the learner and the environment.

[4]

Condition 2.5 is a condition on the Slater’s parameter of the offline problem ρ\rho. Αs is standard for primal-dual methods, when ρ\rho is large enough, it is possible to achieve better regret guarantees, since the Lagrangian space can be bounded by a quantity which is “almost” independent of T.

[5]

We apologize with the Reviewer since Equation (2) contains a typo. The corrected version is t(xh,ah)=Γt+i=1mλt,igt,i(xh,ah)rt(xh,ah)\ell_t(x_h,a_h)=\Gamma_t+ \sum_{i=1}^m \lambda_{t,i}g_{t,i}(x_h,a_h)-r_t(x_h,a_h), which is the definition of the lagrangian losses fed to the primal algorithm.

[6]

Algorithm 3 outputs a policy πt+1\pi_{t+1} , which is the policy that the primal algorithm suggests to play at the next episode based on the previous observed losses.

[7]

One of the main challenges of our work is to build an algorithm which does not require the Slater’s parameter ρ\rho in input. Notice that, when ρ\rho is known, it is sufficient to optimally bound the Lagrange multipliers decision space to H/ρH/\rho in order to have best-of-both-worlds guarantee. Differently, when ρ\rho is not known, it is necessary to resort to different techniques to show that the Lagrange multipliers are bounded during the learning dynamic. We tackle this aspect employing the no-interval regret property of the primal and the dual regret minimizers.

[8]

OPTwOPT^\mathcal{w} is an alternative baseline for the setting with adversarial constraints. This baseline is computed considering the optimal fixed policy which satisfies the constraints at each episode (and not only on average). The introduction of this weaker baseline is motivated by the fact that, when competing against OPTwOPT^\mathcal{w}, it is possible to guarantee sublinear regret and constraints violation simultaneously.

审稿意见
2

This paper studies online constrained Markov decision processes with rewards and constraints that can be either stochastic or adversarial. It proposes the first algorithm for this approach with bandit feedback. Moreover, instead of considering a problem on occupancy measures, their method is based on a primal-dual policy optimization approach. The algorithm requires no knowledge of the Slater parameter. For this, they need a primal regret minimizer that satisfies the “non-interval-regret” property. To achieve this, the authors construct a policy-based algorithm with a fixed share update.

给作者的问题

I would appreciate hearing the authors' responses to the concerns regarding the weaknesses of the paper. I also have some questions to better understand their approach:

  • The authors state that they do not assume Slater's condition (the existence of a strictly feasible solution). However, does requiring condition 2.5 to hold not implicitly imply the existence of such a solution? I would appreciate it if the authors could clarify this point.

  • One of the authors justification for the use of policy optimization is that mixing the occupancy measure produced by an occupancy-measure algorithm with the uniform one (as in the fixed-share update) would not work. However, would it be feasible to instead mix the policy induced by the occupancy measure in an occupancy-measure-based approach with the uniform policy and then use the occupancy measure induced by this mixed policy? This could allow to achieve the same results (dealing with stochastic and adversarial rewards and constraints with bandit feedback) also with an occupancy-measure approach. I would be interested in the authors' perspective on this point.

论据与证据

All theoretical claims are followed by proofs in the appendix or the main paper.

方法与评估标准

As a theoretical paper, the proposed algorithm is well suited to the problem. However, as one of the main advantages, according to the authors, of using a policy optimization algorithm over an occupancy measure approach is its efficiency, some showcase experiments would be useful to illustrate the algorithm in practice.

理论论述

I have checked the theoretical claims concerning the primal regret minimizer algorithm. The dual regret proof seems to follow from existing proofs in the literature such as in Stradi et al. (2024b), so I did not check them in detail.

实验设计与分析

not applicable.

补充材料

I have checked the proofs related to the primal regret algorithm.

与现有文献的关系

This papers propose the first approach to tackle stochastic and adversarial rewards and constraints in CMDPs with bandit feedback. Most previous work focused on the case where rewards are adversarial and constraints are stochastic. Stradi et al (2024b) proposes an approach for stochastic and adversarial rewards and constraints but with full information feedback.

遗漏的重要参考文献

All relevant references related to CMDPs seem to be discussed. As the paper also claims that they develop the first no-interval-regret (a.k.a., adaptive regret) algorithm for unconstrained MDPs under bandit feedback, the paper could benefit from a related work on existing approaches for adaptive regret on online learning and MDPs.

其他优缺点

Strengths:

  • The paper is mostly well-written.

  • The paper presents the first primal-dual approach using policy optimization to deal with both adversarial and stochastic rewards and constraints with bandit feedback.

Weaknesses: The main weaknesses of this paper appear to be the lack of technical novelties in both the algorithm and its analysis, as outlined below:

  • Policy optimization is a Mirror Descent-like algorithm. Cesa-Bianchi et al. (2012) presents a general method for achieving adaptive regret in Mirror Descent approaches through a fixed-share update. Hence, it seems that applying this technique in this context is not novel.

  • The proof of Theorem 3.3 appears to be essentially the same as that of Theorem 4.1 from Luo et al. (2021), except for the fixed-share step, which is a standard technique for achieving adaptive regret in Mirror Descent approaches.

  • Once the primal and dual algorithms are designed to satisfy the no-interval-regret property, the bound on the Lagrange multiplier (Theorem 3.4) appears to follow in a very similar manner to Theorem 5.1 in Stradi et al. (2024b).

  • The challenge of handling bandit feedback instead of full information appears to be addressed using classical techniques from Online MDPs in the development of Algorithm 3 (see, for example, Luo et al. 2021).

其他意见或建议

Below, I provide some suggestions to clarify the paper's presentation for readers.

  • The authors could define better in the beginning of the paper what they mean by "best-of-both-worlds". Usually it means an algorithm that achieves the optimal regret for both stochastic and adversarial rewards. But in this paper the authors deal with stochastic and adversarial constraints.

  • I recommend that the authors define the "no-interval-regret" property earlier in the paper, as it plays a crucial role in the final result and strongly influences the algorithm's design, and the reader may be unfamiliar with this notion. Additionally, this property has been referred to by other names in the literature, such as "adaptive regret," which could be worth mentioning to improve clarity for readers.

  • I would also advise the authors to give a proper mathematical definition of terms Γt\Gamma_t and Ξt\Xi_t from Algorithm 2 to help the reader better understand their meaning in the proposed method.

作者回复

We thank the Reviewer for the effort in evaluating our work.

On the novelty.

We thank the Reviewer for the opportunity to clarify this aspect. In the following, we better highlight the contribution of our work. Indeed, while our primal-dual scheme follows the one of [Stradi et al. 2024] and the primal regret minimizer builds on [Luo et al. 2021], we believe that the contribution of our work is valuable.

Specifically, our analysis and the one in [Luo et al., 2021] are substantially different in two fundamental aspects. First, we show that our algorithm works with a dynamic payoff range, paying a linear dependence only in the maximum range. While this dependence is standard in online algorithms, this happens since the payoff range is known a priori. This is not the case in our setting. To avoid a quadratic dependence on the payoff range, we had to suitably modify the learning rate. Moreover, our algorithm guarantees the no-interval regret property employing the fixed share update. While we agree that the fixed-share update has been shown to work in online convex optimization, we believe that extending this result to MDPs with bandit feedback is valuable.

As a second aspect, notice that all the primal-dual analysis in [Stradi et al. 2024] is designed to work under full-feedback. Indeed, the Lagrangian function given to both the primal and the dual in our work, captures less information that the one in [Stradi et al. 2024], namely, the Lagrangian is built for the path traversed only.

Finally, notice that, differently from any prior work on CMDPs, we complement our analysis studying the regret w.r.t. an optimal solution which satisfies the constraints at each episode. While this baseline is weaker than the first one we propose, it has been largely studied in different settings w.r.t. ours, since it is the only one which allows to be no-regret when the constraints are adversarial. Specifically, this baseline has been studied in online convex optimization with constraints, thus, in single state environments with full-feedback (see, e.g., “Tight Bounds for Online Convex Optimization with Adversarial Constraints”, 2024]). Our results match theirs, showing that T\sqrt{T} regret and violations are attainable when only bandit feedback is available and for an MDP structure of the environment. This analysis is novel and independent from prior works.

On the Slater's condition.

The Reviewer is correct, Condition 2.5 implies the Slater’s condition; however the focus of this work is the lack of knowledge of the Slater’s parameter ρ\rho and not achieving theoretical guarantees when Slater does not hold –however, we provide sublinear bounds for this case, too, namely, when Condition 2.5 does not hold–. The motivation for this is that previous works in the literature with bandit feedback require the knowledge of ρ\rho in advance and they assume ρ\rho to be a constant, ignoring the case where ρ\rho is very small (e.g. [1]).

On policy optimization.

The point raised by the Reviewer is surely interesting. While a more rigorous analysis is required to precisely answer the question, our perspective is that technical challenges may arise when dealing with the biased loss estimator. Finally, notice the policy optimization approaches are generally preferred to occupancy measure ones, since occupancy measure based algorithms require convex programs to be solved at each episode. We believe that the aforementioned one is the main justification for policy optimization.

Finally, we want to thank the Reviewer for the additional comments and suggestions. We will surely take them into account in the final version of the paper.

[1] Castiglioni et al. 2022, “A Unifying Framework for Online Optimization with Long-Term Constraints”.

审稿人评论
  • Slater condition: Thank you for clarifying this point. I would just advise the authors to rephrase the sentence "Notice that, in this work, we do not assume that the Slater’s condition holds. Indeed, our algorithm still works when a strictly feasible solution does not exist" in page 4, lines 187-189, column 1, to mention that you study both cases, when this condition holds and does not hold, and in each case you obtain different bounds (even though this appears later in the paper).

I also thank the authors for clarifying the novelty of their paper, and I agree that the complement on the analysis using a different baseline to overcome the impossibility result of adversarial constraints and the unknown Slater parameter in the original baseline is interesting. However, I still have some concerns about the novelty of the techniques:

  • Bandit feedback: I understand that bandit feedback means less information in what is observed in the Lagrangian. However, this estimate of the Lagrangian is sent as a loss to the primary algorithm, which uses the same techniques of Luo et al. 2021 to deal with the fact that the feedback is bandit.

  • Learning rate: As I understand it, the problem here is that the maximum of the loss function is unknown because it depends on the Lagrange multipliers, and the Slater parameter is unknown. However, overcoming this problem with changing learning rates seems to be the same idea as in Stradi et al. 2024. Each mirror descent approach is indeed performed in a different space (occupancy measures vs. policy), but the question of bounding the loss function seems independent.

作者评论

We would like to thank the Reviewer for the additional feedback.

On the Slater’s condition.

We will surely modify the sentence as suggested by the Reviewer in the final version of the paper.

On the novelty of the techniques.

We believe that similar novelty arguments can be applied to most papers on MDPs and, more broadly, on online learning. Specifically, we agree with the Reviewer that our paper does not introduce “breakthrough” algorithmic techniques; however, combining all these techniques with the required modifications is not a straightforward task.

For instance, the design of no-interval regret minimizer for MDPs is interesting and of independent interest. Moreover, we believe that combining the algorithm of [Luo et al. (2021)] with fixed share while additionally satisfying the requirements needed by our primal algorithm (like unknown range of losses and variable learning rate) is not that trivial.

As concerns the bandit feedback, we agree with the Reviewer for what concerns the primal regret minimizer. Nonetheless, notice that the bandit feedback influences the dual regret minimizer too. Indeed, differently from [Stradi et al. 2024], our dual algorithm cannot receive the expected violation attained by playing the selected occupancy measure, but the actual violation only.

We will surely include this discussion in the final version of the paper and we hope to have properly addressed the Reviewer concern.

审稿意见
3

This paper proposes the first best-of-both-world algorithm that can solve a CMDP only with bandit feedback and enjoys optimal regret and constraint-violation guarantees in both regimes. Furthermore, the proposed method does not require Slater's condition and employs a policy optimization approach, which is more efficient than occupancy-measure-based methods and amenable to neural network function approximation.

update after rebuttal

I thank the authors for the rebuttal responses. I think it is a good contribution to proposing an algorithm for the bandit feedback setting, whose analysis is often non-trivial and sometimes requires interesting ideas, as explained in the rebuttal to Reviewer fRky.

On the other hand, the obtained bounds do not seem tight, or at least unclear if they are tight in other factors such as the state space size. Furthermore, the definition of constraint violation employed in this paper is a weak version, allowing violation cancellation. That said, this paper could be the first step towards an algorithm with completely zero-violation algorithm.

Weighing these aspects, I decided to keep my original score of "Weak Accept".

给作者的问题

Please see the first question in the Other Comments Or Suggestions section.

论据与证据

The paper claims that their proposed algorithm is the first best-of-both-world algorithm that can solve a CMDP only with bandit feedback and enjoys optimal regret and constraint-violation guarantees in both regimes. As far as I know, it is indeed the first best-of-both-world algorithm under the considered setting. All proofs in the appendix seem to be well-written and correct.

方法与评估标准

Regret analysis is a reasonable technique for evaluating an algorithm from a theoretical perspective.

理论论述

I have roughly read the appendix. All the derivation seems to be well-written and correct.

实验设计与分析

There are no experiments.

补充材料

I have roughly read the appendix. All the derivation seems to be well-written.

与现有文献的关系

Since reinforcement learning aims at solving various sequential decision making problems, deepening the theoretical understanding of RL algorithms in many settings has a broader impact on various fields. Reinforcement learning is also presumed to be implemented in the brain, and it is frequently used to model the decision making process of animals. Furthermore, this paper considers the constrained RL setting, which is gaining importance due to increasing use of RL in real life.

The bandit feedback setting is the most practical setting, and this paper solves an important problem. Furthermore, the analysis that does not require Slater’s condition seems also valuable for other CMDP regret analyses.

遗漏的重要参考文献

I think the related work in the Appendix well covers important references.

其他优缺点

Strength

  • The paper is well-structured and clearly written. The authors effectively present the algorithmic framework, motivation, and theoretical analysis, making it easy to follow.
  • The problem setting and the best-of-both-worlds contribution are important in the field of constrained MDPs.
  • Compared to prior work that relies on occupancy measures (and with full-information feedback), the proposed method is a policy-based* approach and amenable to neural-network-based implementations. Furthermore, the proposed method does not require Slater's condition, which many other algorithms need.
  • To deal with the lack of Slater’s condition, the author proposes several techniques to bound the Lagrange multipliers and loss for updating policies, which are of independent interest for other researchers too.

Main Weakness and concerns

  • In the abstract, the authors claim that the regret bound is optimal. However, in CMDP settings, an existing algorithm achieves a T\sqrt{T} strong violation regret without assuming Slater’s condition (Ghosh et al., 2024), which appears to provide a tighter bound than those stated in Theorems 4.4 and 4.5 of this paper. Would you explain differences of the obtained results from theirs?

其他意见或建议

Main Question

Most of the technical challenges and algorithmic contributions in this paper seem to arise due to the absence of Slater’s condition rather than the bandit feedback difficulty. For instance, challenges such as bounding the Lagrange multipliers, the fixed-share update, and the loss range issue (as described on page 2) primarily arise due to the lack of Slater’s condition. While I understand the importance of handling CMDPs without assuming Slater’s condition, it is unclear how these challenges are inherently tied to the bandit feedback setting. Does the combination of no Slater’s condition and bandit feedback introduce additional technical difficulties? If not, it would be nice to clarifying the contribution for no-Slater’s condition in the title and abstract.

Minor weakness or Suggestions

  • In line 3 of Algorithm 3, the phrase "set of all possible transitions" is somewhat unclear. Initially, I thought it as a set storing next states, but upon further reading, I found that it represents the confidence set of transition functions. To avoid confusion, I suggest rewording it as "set of all possible transition functions."
  • The term "loss" in the introduction is ambiguous. To improve clarity, consider explicitly specifying it as "loss used to update the policy" or "Lagrangian loss" to help readers distinguish its role in the analysis.

Due to the lack of important information (e.g., limitation, comparison table), I set the score to 3, but I’m happy to increase the score if all the concerns are addressed.

作者回复

We thank the Reviewer for the positive evaluation.

On Ghosh et al., 2024.

We thank the Reviewer for the interesting point. First, we underline that we use the term best-of-both-worlds referring to the constraints definition. Thus, our algorithm achieves optimal regret guarantees when the constraints are stochastic (and the loss may be adversarial) and when the constraints are adversarial. Notice that our definition of best-of-both-worlds in constrained settings is standard in the literature (e.g., see [1]). (Ghosh et al. 2024) deals with the stochastic reward and constraints setting only, thus their algorithm does not admit adversarial loss as our stochastic setting. Finally, notice that sublinear results employing the positive regret and violation definition does not imply that a better rate can be obtained employing our metrics, thus, our dependence on the number of episodes is still optimal.

On the contribution.

We thank the Reviewer for the opportunity to clarify this aspect. In this work, we design a primal-dual algorithm that does not require the Slater parameter ρ\rho as input, and in addition, it does not need any assumption on ρ\rho. Indeed, in the analysis, we study both the case where the Slater condition is verified (Condition 2.5 holds) and the case where the Slater condition is not verified (namely, either ρ=0\rho=0 or too small compared to the horizon). Nonetheless, while the lack of the Slater condition leads to weaker guarantees, most of the difficulties in our analysis arise from the lack of knowledge of the Slater’s parameter ρ\rho, which is considerably harder to tackle when the feedback is bandit. In particular the bandit feedback affects the design of the primal algorithm and requires some forms of additional regularization in the learning dynamic like the fixed-share update.

We will surely include a comparison table in the final version of the paper to clarify these aspects.

Since these aspects are crucial, please let us know if further discussion is necessary.

We finally thank the Reviewer for the suggestions and we will surely implement them in the final version of the paper.

[1] Castiglioni et al. 2022, “A Unifying Framework for Online Optimization with Long-Term Constraints”.

审稿人评论

Thank you for your response. Your response clarifies the difference from [Ghosh et al. 2024].

“In particular the bandit feedback … like the fixed-share update” Thank you for your response. Clarifying this part in the paper should address my concern.

Additional Minor Suggestion

When the constraint regret allows for error cancellation (your Definition 2.4), I believe it is possible to achieve a tighter, 0 constraint violation regret, with a minor modification (e.g., see Appendix H of https://arxiv.org/abs/2206.11889). Since the modification is almost trivial, I don't think the contribution of the paper degrades without the modification. Yet, mentioning it may help readers who are aiming for zero violation.

作者评论

We would like to thank the Reviewer for the suggestion. We will surely incorporate it in the final version of the paper.

审稿意见
2

This paper considers policy optimization for the CMDP framework with stochastic and adversarial constraints. Compared to the existing works, this paper considers bandit feedback rather than full feedback. The paper uses a primal-dual-based policy optimization method to find the policy directly rather than the state-action occupancy measure. Hence, it is more suitable. The paper shows that one can achieve O(T)O(\sqrt{T}) regret and violation if the Slater's condition holds or O(T3/4)O(T^{3/4}) regret and violation if the Slater's condition does not hold. The paper is the first one providing the best-of-both-the world result in the CMDP setting.

###Update### After going over the paper and the responses again, I have decided to decrease my score. I have explained the reasons below:

The key contributions are the following:

  1. Relaxing the full-information setting information compared to Stradi et al.

  2. Obtaining results even when the Slater's condition does not hold (or, the strict feasibility can be very small).

  3. Use a lower computational complexity approach to get rid of solving a convex optimization problem to find the occupancy measure over all the possible transition probability sets.

While I agree that if someone combines all the tools in a novel way to show that it is not trivial, then we should certainly accept the paper. However, I am not sure all these contributions are very novel. Let me elaborate on my points. If the paper is accepted, I will suggest the authors to consider the followings:

  1. Stradi et al. consider that the transition probability is unknown. Hence, I am not sure how much complexity it adds using the Bandit approach. As it is well known, most of the complexity comes from the unknown transition probability. In the adversarial setting, it may indeed affect, however, I am not sure that the tools are very novel. Perhaps, some other reviewers can convince me otherwise.

  2. I agree that the second claim is indeed a novel, however, as I pointed out there are other approaches that have already considered the setting in the Stochastic setting. Hence, I am not so sure about the contributions. In particular, we do not know the lower bound, hence, the tightness can not be commented here.

  3. The third contribution is interesting, however, it directly follows from Luo et al.. It is not clear whether extending it to the constrained case will be very complex.

给作者的问题

  1. Can the authors comment on Weakness 1?

  2. Can the authors point out the main technical novelties compared to Luo et al.'21. In particular, why not combine the results of Luo et al.'21 and the existing understanding of the CMDP works in enough to get these results?

论据与证据

The claims are clearly supported by the proofs.

方法与评估标准

The paper is mainly theoretical in nature. The paper's claims are supported theoretically.

理论论述

I briefly go over the proofs. There are no major concerns.

实验设计与分析

N/A

补充材料

I have briefly gone over the proofs.

与现有文献的关系

The paper has made a clear connection with the existing works.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. This is the first result that achieves sub-linear regret and violation using bandit feedback. Hence, in terms of results, the paper's contributions are significant.

  2. The paper is relatively well-written, and the proofs are nice and clean.

Weaknesses:

  1. Even though the paper talks about policy optimization, it still relies on the state-action occupancy measure. In particular, the paper relies on the computing the maximum state-action occupancy measure among the possible transition probability sets (line 5 in Algorithm 3). The proposed approach also needs to maximize the bonus term as well over the possible transition probability sets (line 6 in Algorithm 3). The paper did not talk about how easy or difficult to solve this problem.

  2. The paper heavily relies on Luo et al. '21. The paper did not explain the technical novelties.

  3. The approach is model-based and thus can not be applicable to large state-space (even with linear function approximation).

  4. The paper considers a weaker notion of the violation where the violations can cancel each other. In particular, one can have violation of +1, and then violation of -1, altogether violation is 0, however, it is still violating half of the episodes. Can the paper talk about a stronger notion of violation (cancelation free)? Recent works in CMDP have addressed the issue.

[A1]. Ghosh, Arnob, Xingyu Zhou, and Ness Shroff. "Towards achieving sub-linear regret and hard constraint violation in model-free rl." In International Conference on Artificial Intelligence and Statistics, pp. 1054-1062. PMLR, 2024.

[A2]. Müller, Adrian, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, and Niao He. "Truly no-regret learning in constrained mdps." arXiv preprint arXiv:2402.15776 (2024).

其他意见或建议

N/A.

作者回复

We thank the Reviewer for the positive evaluation.

On W1 and Q1.

As shown in [1], computing the maximum state-action occupancy measure among the possible transition probability sets (Line 5 in Algorithm 3) can be solved efficiently via a greedy algorithm. The same reasoning holds for the corresponding minimum and for the bonus quantity (see Luo et al.'21). We finally remark that performing the OMD update over the occupancy measure space (as in occupancy-based methods) would require solving a convex program at each episode, which is considerably heavier than computing upper occupancy bound or bonus terms.

On W2 and Q2.

The algorithm from [Luo et al. ‘21] cannot be employed as our primal regret minimizer, as it is. In particular, our primal regret minimizer must guarantee the no-interval regret property – in order to show that the Lagrangian multipliers are bounded during the learning dynamic –, which requires the introduction of a fixed-share update. To the best of our knowledge, our primal regret minimizer is the first to show no-interval regret property for adversarial MDPs with bandit feedback. Furthermore, we had to improve the dependence of the regret bound on the loss range, introducing a dynamic learning rate. This learning rate is used to improve the dependence on the payoff’s range from a quadratic dependence to a linear one. Indeed part of the complexity of designing the primal algorithm is that the range of the losses cannot be known a priori, as it is a lagrangian loss and therefore proportional to the problem-specific lagrangian multipliers.

On W3.

We agree with the Reviewer that extending the guarantees presented in this paper to model-free approaches scalable beyond tabular CMDPs setting would be of great interest. However, the results presented in this paper are novel even for the simplest tabular case, and we believe this work might be a good starting point for extending the results to more challenging scenarios in future works.

On W4.

Guaranteeing positive violation requires the primal and the dual regret minimizers to attain convergence to the constrained optimum, namely, to have last-iterate convergence to the equilibrium of the Lagrangian game. To the best of our knowledge, no algorithm can guarantee convergence in last-iterate in the setting faced by the primal regret minimizer. Finally, notice that, if future works may find no-regret algorithms with last-iterate convergence on the settings faced by the primal and dual regret minimizers, sublinear positive violation (in the stochastic setting) would directly follow by the analysis of our primal-dual procedure. We underline that the papers provided by the Reviewer focus on stochastic settings only, where it is possible to exploit confidence intervals to estimate both rewards and constraints. Indeed, it cannot be done in our setting, since it would prevent any algorithm to deal with adversarial settings.

[1] Jin et al. 2020, "Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition".

审稿人评论

I appreciate the authors' replies. Regarding last-iterate convergence, there are works in the stochastic setting, [A1].

As another comments, when ρ=0\rho=0, how you are bounding the regret and violation bound? The approach presented in [A2] also does not use the Slater's condition.

[A1]. Ding, D., Wei, C. Y., Zhang, K., & Ribeiro, A. (2023). Last-iterate convergent policy gradient primal-dual methods for constrained mdps. Advances in Neural Information Processing Systems, 36, 66138-66200.

[A2]. Ding, Y. and Lavaei, J., 2023, June. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 6, pp. 7396-7404).

作者评论

We thank the Reviewer for the possibility to clarify these aspects.

Regarding last-iterate convergence, there are works in the stochastic setting, [A1].

We certainly agree that there exist primal-dual procedures which attain last-iterate convergence to the constrained optimum as in [A1]. Nonetheless, not only [A1] focuses on stochastic setting only, but it also does not focus on regret minimization. Thus, their approach is completely different from ours, since no exploration-exploitation trade-off is involved, and its cannot be easily generalised to our setting.

Moreover, please notice that, in order to achieve last iterate convergence employing our primal-dual scheme, it is necessary to have access to (primal and dual) regret minimizers which attain last-iterate convergence under bandit feedback. To the best of our knowledge, the only algorithm which attains those guarantees is [1]. Nevertheless, the convergence rate is of order t1/8t^{-1/8}, which would lead to highly suboptimal regret and violation bounds.

As another comments, when ρ=0\rho=0, how you are bounding the regret and violation bound? The approach presented in [A2] also does not use the Slater's condition.

When ρ=0\rho=0, we bound the regret and the violation as O~(T3/4)\tilde O(T^{3/4}), which is in line with [A2], when Slater’s condition is not enforced (see their Theorem 9). This is done by suitably clipping the Lagrangian space as in Line 3 of Algorithm 2.

[1] “Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games with Bandit Feedback”, Cai et al. (2023)

最终决定

This paper studies constrained MDPs with rewards and constraints that can be either stochastic or adversarially selected. They provide best-of-both worlds results with respect to the constraints (no logarithmic bound for stochastic regret).

Its main contributions are

  1. Relaxing the full-information assumption on rewards/constraints compared to prior work
  2. Obtaining results even when the Slater's condition does not hold (or, can be very small).
  3. Use a lower computational complexity approach to get rid of solving a convex optimization problem to find the occupancy measure over all the possible transition probability sets.

None of these in isolation is novel. Nonetheless combining all parts seems to be non-trivial.

All reviewers who engaged in the discussion agree that the proofs are sound, but the technical novelty is somewhat limited.