Taming Adversarial Constraints in CMDPs
摘要
评审与讨论
The paper studies RL for constrained MDP where both the rewards and the constraints are adversarial. The paper bound the regret and (positive) constraint violation in terms of the corruption budget . There is a related corruption-robust work under unconstrained MDP, but the setting for constrained MDP is largely unexplored. The paper first presents an optimistic algorithm when the corruption budget is known in advance. Then, they introduce a meta algorithm, reminiscent of the corralling bandit algorithm, that runs a meta bandit algorithm over multiple optimistic algorithms tuned for different corruption budgets.
优缺点分析
Strengths:
The problem setup and result are novel. The presentation of their algorithm is clear and mostly easy to follow.
Weaknesses:
The algorithm seems to be similar to the one introduced by Jin et al. (2024). Highlighting the difference in the main paper seems necessary.
问题
Q1) The corralling bandit reduction idea in corruption setting seems to have been also used in Jin et al. 2024 (Algorithm 3 in their paper uses similar FTRL). Although Jin et al study a different setting (adversarial loss + corrupted transitions), I think it would be good to discuss their algorithm and highlight the novelty of your algorithm in comparison to their algorithm in detail. What is the main difference from their algorithm?
Q2) The paper studies corrupted reward + corrupted cost setting. Have you considered extending to corrupted reward + corrupted cost + corrupted transition?
局限性
Yes.
最终评判理由
Authors have addressed my questions. I am maintaining my positive score.
格式问题
None.
We thank the Reviewer for the positive evaluation of our work.
The corralling bandit reduction idea in corruption setting seems to have been also used in Jin et al. 2024 (Algorithm 3 in their paper uses similar FTRL). Although Jin et al study a different setting (adversarial loss + corrupted transitions), I think it would be good to discuss their algorithm and highlight the novelty of your algorithm in comparison to their algorithm in detail. What is the main difference from their algorithm?
We thank the Reviewer for the opportunity to clarify this aspect. Indeed, our algorithm for the unknown setting is inspired by Coralling techniques as the one of [Jin et al. 2024]. However, these meta-procedures are specifically designed to deal with unconstrained problems, namely, their goal is single-objective: minimizing the regret. In our setting, minimizing the regret is not the only objective, since we want to additionally keep the violations as small as possible. Thus, it is not possible to blindly apply Corrall-like techniques, as it would lead to arbitrarily big violations.
In our work, we propose a Lagrangian approach by building the meta-algorithm's loss as a positive Lagrangian function, which, to the best of our knowledge, is a novel tool in the online CMDPs literature. Furthermore, properly balancing the regret and the positive constraints violation requires a careful calibration of the Lagrangian multiplier. On the one hand, the Lagrangian multipliers must be large enough to guarantee that it is not possible to compensate large constraints violation with a regret sufficiently negative; on the other hand, since the FTRL regret bound depends on the Lagrangian loss range of values, the Lagrangian multiplier must be small enough not to prevent FTRL from being no-regret (see Section F.1 for additional details on these aspects). Moreover, the rest of the analysis presents various challenges derived from the constrained nature of the problem and the definition of positive constraints violation (e.g., Section F.3). Thus, while inspired by Corral for what concerns the stability results, our analysis is novel and independent from existing works in the field.
We will surely include this discussion in the final version of the paper.
The paper studies corrupted reward + corrupted cost setting. Have you considered extending to corrupted reward + corrupted cost + corrupted transition?
We agree with the Reviewer that extending our results to the corrupted transitions case would be of great interest. Nonetheless, the aim of this work was mainly that of easing the impossibility result on learning with adversarial constraints. Thus, we did not take into consideration the extension to corrupted transitions, even if we consider it an interesting future research direction.
Thank you for the explanation. I will take it into account when making the final recommendation.
The author studies a constrained Markov decision process (CMDP) setting in which both the rewards and constraints are sampled from probability distributions that may change adversarially across episodes. This setting differs from the standard CMDP framework, which usually assumes that these distributions are fixed or stationary. To address this more general and challenging setting, the author proposes an approach based on occupancy measures and provides theoretical performance guarantees. Specifically, the paper considers two distinct cases: one where the nature of the adversarial corruption is known, and one where it is unknown. For each case, a dedicated algorithm is developed to ensure valid and efficient decision-making under uncertainty.
优缺点分析
Strengths. The paper is well-structured, and the theoretical analysis is solid and convincing. This work addresses a challenging setting in which both the rewards and constraints are sampled from probability distributions that may change adversarially across episodes. Such a setting is difficult to handle and is not supported by many existing approaches in the literature.
Weaknesses. Although the theoretical guarantees are strong, the paper would benefit from empirical results. Including experiments would help illustrate the practical performance of the proposed algorithms and provide a more direct understanding of their behavior in realistic settings.
问题
-
While it is common to use occupancy measures in discrete settings, they can also be applied in continuous state or action spaces. Can the proposed framework be extended to handle continuous CMDPs? If so, what are the main challenges or limitations in doing so?
-
For CMDP problems, a practical way to evaluate the performance of an algorithm is to examine the level of constraint violation in the empirical results. In particular, it is useful to study the trade-off between the objective value and the degree of constraint violation in out-of-sample testing. Can the author provide experimental results to demonstrate this trade-off and compare the performance of the proposed method against relevant baselines?
局限性
Yes
最终评判理由
Keep score
格式问题
No
We thank the Reviewer for the positive evaluation of our work.
While it is common to use occupancy measures in discrete settings, they can also be applied in continuous state or action spaces. Can the proposed framework be extended to handle continuous CMDPs? If so, what are the main challenges or limitations in doing so?
We thank the Reviewer for the interesting question. We believe that in order to extend our results to continuous settings, it is sufficient to build a learning algorithm tailored for continuous state-action spaces for the case where is known. Then, to generalize the results to the unknown setting, we believe that the approach employed in Lag-FTRL can still hold, with some minor changes.
For CMDP problems, a practical way to evaluate the performance of an algorithm is to examine the level of constraint violation in the empirical results. In particular, it is useful to study the trade-off between the objective value and the degree of constraint violation in out-of-sample testing. Can the author provide experimental results to demonstrate this trade-off and compare the performance of the proposed method against relevant baselines?
We agree with the Reviewer that experiments are always beneficial; nonetheless, we underline that in the online CMDPs literature, many works do not provide experimental results (e.g., [1], [2], [3], [4]).
[1] "Exploration-exploitation in constrained mdps", Efroni et al. (2020)
[2] "Provably Efficient Safe Exploration via Primal-Dual Policy Optimization", Ding et al. (2020)
[3] "Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss", Qiu et al. (2020)
[4] "Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs", Liu et al. (2021)
Thank you for your clarifications. I am satisfied with the responses and will maintain my original score.
This paper studies a CMDP where both rewards and constraints can face adversarial corruptions, whose cumulative magnitudes are denoted by and resp. Under unknown-transition and bandit-feedback model, an occupancy-measure--based algorithm with regret & positive constr violation of order is proposed. When no corruptions are present, i.e., a classical stochastic CMDP instance, the bound translates to ; in the worst case when rewards and constraints are completely adversarial, the bound agrees with the lower bound.
优缺点分析
Strength:
- The paper is overall well-written, with a clear flow on related works, main technical challenges, and the two steps towards regret.
- The -style results is tight (up to logs), which distinguishes itself from the typical -style dependency in classical bandits/MDPs and highlights the difficulty of adversarialness in CMDPs, as pointed out by the lower bound.
- The technical contribution regarding the extension of CORRAL is non-trivial -- while the online learning loss / updates in Eqs. (5) and (6) do not look pretty astonishing on their own, according to the authors there is no suitable CORRAL-variants that work for primal-dual--tyle problems. But I'm not fully certain that I understand this part correctly; please refer to Weakness 2.
Weakness:
- For the known- case, the algorithm is more-or-less standard adaptation of a stochastic algorithm to its corrupted counterpart. Nevertheless, given the extension to unknown- case, I understand that Sec 3 is more like a building block rather than main contribution.
- The authors mentioned that Lag-FTRL can be of independent interest; however, the current algorithmic form and description lack sufficient explanations to allow a seamless extension. It would be much better if the authors could write something as "suppose that we'd like to solve a primal-dual problem like xxx with primal objective aaa and dual objective bbb, then instead of standard CORRAL we can use a Lagrangian-based variant of CORRAL that does ... s.t. aaa and bbb are optimized".
- The transitions must be stationary (but I understand that it would be substancially harder for occupancy-measure--based algorithms). Nevertheless, it would be very interesting to incorporate the non-stationary(--transition) MDP literature.
Minor: Lines 68--73 don't read to different from CORRAL / its variants; it would be better to move Lines 100--104 earlier.
But overall, given the clearly-written draft, the strong regret guarantee, and the non-trivial technical contribution, I would love to recommend an acceptance.
问题
See Weaknesses 1 -- 3.
Additional question: Does it matter to change to where are the argmin in defns of and ? I think no?
Also in Lag-FTRL -- why we don't need to dynamically maintain a Lagrangian during the algorithm, like a standard primal-dual--based online learning algorithm? That is, why an upper bound on the Lagrangian would suffice?
局限性
Not really. But aside from the Weakness 3 which is a minor one, I do not see many obvious limitations.
最终评判理由
I am satisfied with Authors' responses and decide to keep my original evaluation.
格式问题
No
We thank the Reviewer for the positive evaluation of our work.
The authors mentioned that Lag-FTRL can be of independent interest; however, the current algorithmic form and description lack sufficient explanations to allow a seamless extension. It would be much better if the authors could write something as "suppose that we'd like to solve a primal-dual problem like xxx with primal objective aaa and dual objective bbb, then instead of standard CORRAL we can use a Lagrangian-based variant of CORRAL that does ... s.t. aaa and bbb are optimized".
We thank the Reviewer for pointing out this aspect. We state that our meta-algorithm is of independent interest since it can be applied to many constrained online learning scenarios. In these settings, all the other coralling techniques fail, since they are tailored for single-objective scenarios. For instance, if future works will manage to extend NS-SOPS (with the additional stabilization part) to the linear constrained MDPs setting (see [1] for the definition in unconstrained MDPs, [2] for the constrained one), the analysis for the unknown setting will follow by employing Lag-FTRL with only minor changes.
We will better clarify it in the final version of the paper.
The transitions must be stationary (but I understand that it would be substancially harder for occupancy-measure--based algorithms). Nevertheless, it would be very interesting to incorporate the non-stationary(--transition) MDP literature.
We thank the Reviewer for the suggestion. We will surely include more details on the MDPs with corrupted transitions literature in the final version of the paper.
On the change of the baseline.
The Reviewer is correct, our results still hold for .
Also in Lag-FTRL -- why we don't need to dynamically maintain a Lagrangian during the algorithm, like a standard primal-dual--based online learning algorithm? That is, why an upper bound on the Lagrangian would suffice?
We thank the Reviewer for the interesting question. From a high level perspective, the reason is twofold. First, Lag-FTRL is not directly learning how to satisfy the constraints. Indeed, it is in charge of selecting the optimal subroutine, which, instead, manages the constraints satisfaction. Secondly, the loss given to Lag-FTRL is not a simple Lagrangian function, as it is in standard primal-dual methods. Differently, it is a positive Lagrangian; thus, when safe policies are selected, the second part of the Lagrangian goes to , thanks to the operator, and there is no need to reduce the Lagrangian value.
[1] "Online Learning in MDPs with Linear Function Approximation and Bandit Feedback", Neu et al. (2021)
[2] "Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model", Liu et al. (2025)
Thank you for the responses. I encourage the authors to incorporate part of their responses into the revision.
Constrained MDPs have traditionally been studied under two separate settings: with either stochastic or adversarial rewards and constraints. This paper designs algorithms with bandit feedback that can handle both, allowing rewards and constraints to change non-stationarily across episodes. The degree of adversariality is characterized by what the authors call the adversarial corruption C (essentially, the L1 deviation of the expected rewards and constraints). The first part of the paper introduces an optimism-based algorithm achieving regret when C is known. The second part develops an FTRL-based meta-algorithm that achieves the same regret guarantees even when C is unknown, by running multiple instances of the optimistic algorithm for different candidate values of C.
优缺点分析
The paper is well written overall, and I appreciate how the authors point out relevant connections to existing work. Conceptually, this paper shows that one can handle positive constraints (that is, constraints cannot cancel each other at different times), and shows that one interpolate between stochastic and adversarial setups with a regret bound that is analogous to previous works on corruption robust online learning, which is of interest to the community.
The proof techniques build significantly over previous works, especially Jin et al. 2024. The proof of the first result when the corruption level C is known seems fairly standard (handling the corruption level is mainly done by constructing appropriate confidence intervals). The main technical contribution seems to be for the second result when C is unknown. The high-level procedure is quite simple: instantiate (log T) several algorithms for exponential scales of C and run some FTRL procedure on it, however the proof requires bounding the size of the dual variable to enforce constraints and using a stabilizing technique from Jin et al. 2024. Overall the novelty seems sufficient.
One area that could be strengthened is the motivation. The authors give concrete examples in introduction for constrained MDPs, which are relatively well established, but could spend some time motivating adversarial rewards and constraints for the examples for constrained MDPs.
The benchmark should also be further motivated, since it is not standard and not fully consistent with the algorithms. Specifically, the violation total cost counts each violation positively: ,
but the benchmark for the regret is a policy that only satisfies these constraints in a non-positive way (that is, cancellations are allowed): where is only constrained to satisfy , where . This is of course a stronger benchmark that if we were to ask that for all but it would seem natural to impose the latter (or similarly to agnostic-type bounds, count the violation of the benchmark in a positive way) and then perhaps achieve stronger bounds (if possible). In that case, is it possible to achieve better bounds? Does the linear regret T bound still hold in a fully adversarial case for this milder benchmark?
Last, the paper would be strenghened by discussing further potential lower bounds. In particular is regret and violation necessary? Are there any known lower bounds (e.g. reductions from corruption-robust online learning? What about the dependencies in terms of , are these tight?
问题
Some questions above.
-
Is there any tradeoff between regret and violation? In particular, if we do not have lower bounds , can we at least show that Algorithm 2 or 3 are Pareto-optimal?
-
In the introduction, I wasn’t sure how the second application about recommender systems (ensuring users are not exposed to offensive content) fits into this framework -- how are the constraints non-stationary?
Minor comments:
p3 Section 2. Define formally as the length of each episode.
p4 l162. was only defined in the footnote, it would be preferable to have it defined formally in the main body.
p8 l326. At this point, was not yet introduced in the text.
局限性
Limitations discussed appropriately.
最终评判理由
I am satisfied with Authors' responses and decide to keep my original evaluation.
格式问题
NA
We thank the Reviewer for the positive evaluation of our work.
One area that could be strengthened is the motivation. The authors give concrete examples in introduction for constrained MDPs, which are relatively well established, but could spend some time motivating adversarial rewards and constraints for the examples for constrained MDPs. In the introduction, I wasn’t sure how the second application about recommender systems (ensuring users are not exposed to offensive content) fits into this framework -- how are the constraints non-stationary?
We thank the Reviewer for the opportunity to further elaborate on this point. Adversarial constraints are present in all the settings in which the environment encompasses other agents. For instance, in a repeated auction with budget constraints, the budget consumption at a given round depends on the bids of all the other agents participating in the auction. Thus, in such a setting, the constraint violation is adversarially selected. Furthermore, imagine a setting where the rewards and the constraints are somehow dependent and market-related. Rewards could model the probability more the one item is purchased, since the algorithm deals with a multi-state environment, while the constraints could model the ROI a certain company aims to achieve. Since the market is stochastic but highly non-stationary, it is reasonable and efficient to model the problem with adversarial rewards and adversarial constraints.
As concerns the recommender system application, the non-stationarity of constraints may be due to the fact that the way in which users perceive recommendations changes over time as an effect of mutated environmental conditions (e.g., at the geopolitical level). Thus, what is perceived today as "offensive'' by a user may be acceptable in the future.
We will surely include this discussion in the final version of the paper.
Is it possible to achieve better bounds? Does the linear regret T bound still hold in a fully adversarial case for this milder benchmark?
We thank the Reviewer for the interesting point. The impossibility result does not hold when the benchmark satisfies the constraints at each episode; thus, in this case, it is possible to achieve sublinear regret and sublinear violation. Similar results exist, for instance, in the online convex optimization literature (e.g., see [1]). Nonetheless, we remark the following. First, the weaker benchmark can be arbitrarily worse than the one employed in our work. Second, the aim of this work is mainly the one of easing the impossibility result of [Mannor et al., 2009]; thus, focusing on the stronger benchmark was the way to go. To conclude, we believe that both baselines are of interest and deserve to be studied, but the results in the different settings are not easily comparable.
On the tightness of the bounds.
We are not aware of a formal lower bound on the dependence of . However, in the corruption robust literature, an upper bound of order is commonly accepted as state of the art result.
Indeed, while the problem of studying corrupted CMDPs is fairly unexplored, we can draw a parallelism between our work and the literature of corruption-robust unconstrained MDP with corrupted transitions, where a similar impossibility result holds when both the losses and the transitions are adversarial. To the best of our knowledge, the dependence is considered state of the art for the MDPs with corrupted transitions and for other similar settings (see, [2],[3],[4],[5]) and commonly accepted as an optimal result. For what concern corrupted CMDPs, to the best of our knowledge, there are no works that achieve even a multiplicative dependence on the corruption, namely, a bound of order .
Finally, given the lower bound in online stochastic unconstrained MDPs which is of the order (see, [6]), we can say that NS-SOPS is tight in , while it has a linear dependence in . Lag-FTRL is tight in , while it has a linear dependence in and a quadratic dependence in .
We will surely include this discussion in the final version of the paper.
[1] "Optimal Algorithms for Online Convex Optimization with Adversarial Constraints", Sinha et al. (2024)
[2] "No regret online reinforcement learning with adversarial losses and transitions", Jin et al. (2024)
[3] "Stochastic linear bandits robust to adversarial attacks", Bogunovic at al. (2021)
[4] "Achieving near instance optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously", Lee et al. (2021)
[5] "Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions", He et al. (2022)
[6] "Is q-learning provably efficient?", Jin et al. (2018)
Thank you for the responses. Adding some of this discussion (motivation + benchmark) in a revised version could be useful.
This paper studies an intermediate regime in CMDPs, where the rewards and constraints are generally stochastic, but can be corrupted by a corruption budget. This allows to improve on prior impossibility results when the corruption is sublinear.
With known corruption budget C, the algorithm is fairly standard. The authors match the bound in the unknown C regime, which required some non-trivial work.
All reviewers agree that the paper contains solid contributions, though it could be clarified to what extent they introduce new technical novelties over prior work.