Achieving $\tilde{O}(1/\epsilon)$ Sample Complexity for Constrained Markov Decision Process
摘要
评审与讨论
The paper proposes a new algorithm that solves the CMDP problem in sample complexity, which improves the best-known sample complexity in the literature. To achieve this, this paper made three contributions: (1) New characterizations of the problem instance hardness for CMDP problems. (2) A new algorithm based on LP literature instead of the traditional RL literature. (3) Extended results adopted to online LP literature.
优点
The proposed method is new. Though it is mainly inpired by existing LP literature, to my understanding similar approaches have not been applied in RL or ML research. The LP formulation also leads to a new perspective in understanding the CMDP problem.
缺点
- The presentation of this work is relatively poor. The author listed three contributions but I feel hard to understand how these contributions have contributed to this work. For example, in the Contribution 1, the author should tell (a) what is the problem instance hardness for CMDP problem, (b) what is the motivation of proposing new characterizations of the problem instance hardness for CMDP problems, (c) why the new characterization is helpful in developing the new algorithm or achieving better complexity.
- There is no sufficient discussions on Lemma 2.1. How is it connected to the whole section Characterization of Instance Hardness?
- The reward, cost, and transitions are all estimated during training. I cannot agree that this algorithm work with unknown transition probabilities since it requires to estimate that.
- I doubt if the theoretical analysis is correct since it has been known that the sample complexity lower bound is . See [R1] and [R2].
[R1] Azar, Mohammad Gheshlaghi, et al. "Reinforcement learning with a near optimal rate of convergence." (2011). [R2] Vaswani, Sharan, Lin Yang, and Csaba Szepesvári. "Near-optimal sample complexity bounds for constrained MDPs." Advances in Neural Information Processing Systems 35 (2022): 3110-3122.
问题
I am mainly concerned about the complexity. Can the author clarify the difference between this work and classical RL literature?
局限性
This is a theoretical work so there is no any negative impact.
We appreciate your insightful comments, which allow us to provide further clarifications. Please find below our response to the weakness and the questions. We hope that our response would clarify your concerns about the paper and we are happy to provide further clarifications if needed.
: Thank you so much for the comment! Please allow us to further clarify. Please note that when deriving instance-dependent learning, it is important to define a measure to describe how difficult it is to separate the optimal policies from the sub-optimal ones. The importance of defining such a measure has been illustrated in other problems, such as multi-arm-bandit problems (e.g. Lai and Robbins (1985)) and reinforcement learning problems (e.g. Auer et al. (2008)). There are also other works studying how to characterize such a measure for instance-dependent learning on general sequential decision making problems, for example Wagenmaker and Foster (2023). This measure is usually defined as the gap (a positive constant) between the value of the optimal policies and the value of the best sub-optimal policies. However, since the optimal policies for CMDP problems are randomized policies, the sub-optimal policies can be arbitrarily close to the optimal ones. In our paper, we show that if we restrict the policies to the ones represented by the corner points, then such a gap can be characterized as the difference between the optimal corner points and the sub-optimal corner points. Suppose this gap is , then it requires number of samples to identify the optimal corner point.
In summary, (a). ``problem instance hardness'' is a measure of the number of samples needed to separate the optimal policies from the sub-optimal ones; (b). The optimal policies for CMDP are randomized policies and thus previous measures do not apply. We need to develop a new measure that can separate the optimal randomized policies from the sub-optimal randomized policies for the CMDP problems (by restricting to the policies represented by the corner points of the LP); (c). Our new algorithm is developed based on corner point characterization. To be specific, our algorithm 1 is developed to identify one optimal corner point (one optimal basis of the LP), and our algorithm 2 resolves the LP sticking to the identified corner point to learn the optimal randomization. As we can see, identifying the optimal corner point (basis) and resolving the LP sticking to this corner point (basis) are the key elements of our algorithm, which is motivated by the instance hardness characterization via corner point.
: Thank you! Lemma 2.1 is to show that for any LP, there exists an optimal basis or corner point and the LP basis or corner point can be characterized via non-zero variables and binding constraints. As discussed in our response to your previous point, this corner point representation motivates our entire approach.
: Thanks for the comment. Our work assumes that the transition probabilities are unknown and need to be estimated from the data. Our algorithm is more in the sense of model-based, and this is why we need to construct estimates of the transition probabilities, as well as rewards and costs, during the execution of our algorithm.
: Thank you for mentioning the two important papers! It is indeed true that if we are seeking for a sample complexity bound, then is the best we can hope for, just as illustrated in the two papers you mentioned. However, we would like to emphasize that we are deriving instance-dependent sample complexity in our paper. This is the key reason why we can break the lower bound and obtain an improved sample complexity. To be specific, for a problem instance , we can denote by the number of samples needed to construct an -optimal policy. Then the worst-case lower bound implies that . However, if we do not consider the worst-case guarantee, i.e., if we do not maximize over the problem instance , then we can characterize an instance-dependent constant (independent of ) such that . The main contributions of our paper are to (i). find a cornet-point characterization of the instance-dependent constant , and ii). derive a policy that achieves the instance-dependent sample complexity bound. Please note that our bound does not contradict with the worst-case lower bound . It is simply that we are seeking for an instance-dependent bound and when the problem instance is favorable such that the constant is smaller than , our bound strictly improves upon the worst-case bound.
: Thank you for this question! The main difference between our work and the classical RL for CMDP literature is that we are deriving an instance-dependent sample complexity bound, while the previous literature focuses on the worst-case sample complexity bound. Indeed, the instance-dependent guarantee has been considered in the previous RL literature (but without constraints). See for example [1] and [2] for the logarithmic regret which transfers into sample complexity bound. However, these works do not consider the existence of the constraints.
[1]. Velegkas, Grigoris, Zhuoran Yang, and Amin Karbasi. "Reinforcement learning with logarithmic regret and policy switches." Advances in Neural Information Processing Systems 35 (2022).
[2]. He, Jiafan, Dongruo Zhou, and Quanquan Gu. "Logarithmic regret for reinforcement learning with linear function approximation." International Conference on Machine Learning. PMLR, 2021.
Thanks for the clarification! I did misunderstand the contribution of this paper. It has been much more clear. Now I increase my score from 3 to 5.
Thank you so much!
The paper studies the linear program formulation of constrained MDPs. The paper first characterizes the instance hardness of the underlying LP. Then, by proposing an algorithm that operates in the primal space and resolves the primal LP in an online manner, the paper derives an overall sample complexity of O(1/\epsilon) up to logarithm terms.
优点
- The paper provides a strong bound on the constrained MDP using LP-based approaches. The obtained sample complexity is the first in the literature that achieves \tilde O(1/\epsilon).
- The derivation and presentation of theoretical results are clear and easy to follow. The proposed theoretical framework has the potential to be applied to more general online LP problems.
缺点
- The introduction and related work parts of the paper are not comprehensive enough (especially the related work). In the beginning of Section 1.2, the paper compares itself with [25, 29, 44, 14]. Yet, all of these works focus on policy-based methods, and I think it is not directly comparable with the proposed method in this paper. For example, the dependency of the sample complexity in this work is |S|^4, and this is usually lower for policy-based methods. Moreover, I think the paper might miss several related literature also studies occupancy-measure based approaches, for example, "Achieving Zero Constraint Violation for Concave Utility Constrained Reinforcement Learning via Primal-Dual Approach" by Bai et al.
- The paper does not provide numerical simulations. If there is some simulation, I think it helps the reader to have a better sense of how the convergence rate of the algorithm depends on the problem size.
问题
- Can the author provide a more comprehensive literature review and problem introduction? And I think it is also important to consider the dependency of the sample complexity on the problem-related parameters (i.e., |S| and |A|) in the comparison.
- It would be better to briefly discuss the proof idea/implications of theoretical results around the theories. Currently, the ending of the paper seems too abrupt?
- If it is possible, can the author provides some preliminary numerical experiments (not mandatory)?
局限性
I did not find the place where the paper explicitly discuss the limitation (yet, the author claims "Yes" in the checklist question 2). BTW, it seems the author also forgot to provide justifications for other questions in the checklist as well.
We would like to thank you for your insight review! Please find below our response to the weakness and your questions!
: Thank you for mentioning the important references to us! We will refine our literature review part to incorporate a better comparison with existing works and approaches, including the important work you pointed out. Briefly speaking, our work presents a new algorithm. We adopt an occupancy measure representation of the optimal and obtain an LP to work with, which is similar to the previous work. However, our algorithm resolves an LP and operates in the primal space, which is fundamentally different from the previous work that adopts a primal-dual update (e.g. Stooke et al., 2020, Ding et al., 2022, Zeng et al., 2022, Bai et al., 2023, Moskovitz et al., 2023, ). There is also work developing primal-based algorithms, for example (Liu et al., 2019; Chow et al., 2018; 2019; Dalal et al., 2018, Xu et al. 2021). Our algorithm is completely different from the previous work and we obtain new results. The result is that we are able to obtain an instance-dependent sample complexity the first time in the literature, which improves upon the worst-case sample complexity established in the previous work. Though the constrained optimization approach and the Lyapunov approach have also been developed for CMDP problems, they do not enjoy a theoretical guarantee. In comparison to the literature, we develop a new primal-based algorithm and achieve the first instance-dependent sample complexity for CMDP problems.
We will add a more comprehensive literature review. We would be appreciative if you could point out the papers that we haven't discussed and we will add discussions of those.
: Thank you for the comment! We have conducted basic numerical experiments to illustrate the empirical performance of our algorithms. Please refer to the ``global response'' for more details.
: Yeah, sure! Please find in our response to the first weakness part the refined literature review on the comparison with existing works. Also, we provide the following discussion on the dependency on other problem parameters.
``We discuss the dependency of our sample complexity bound on problem parameters other than . We restrict to the MDP context without resource constraints. Denote by the state set and the action set. We show a sample complexity bound of , where is the constant that represents the hardness of the underlying problem instance. Compared to the optimal worst-case sample complexity that is achieved in a series of work (e.g. Sidford et al. 2018, Wainwright 2019, Wang 2020, Agarwal et al. 2020, He et al. 2021), our bound has a worse dependency over and . This is due to our algorithm being LP-based and the dimension of the LP ( and ) will influence our final bounds. However, our bound enjoys a better dependency in terms of . For the general CMDP problem, our bound will depend additionally on the conditional number of the constraint matrix in the LP formulation, which is a byproduct of the resolving LP heuristics (e.g. Vera and Banerjee 2021, Li et al. 2021). However, our sample complexity bound depends polynomially on other parameters including , , , and the number of constraints.''
Please note that even though our bound has some additional dependencies on other parameters, our bound can still improve upon the worst-case bound for any problem instance as long as is set to be small. To be specific, for an instance , denote by our bound (logarithmic term neglected) and denote by the worst-case bound. Then, as long as we set , our bound will be smaller. Therefore, our bound is favorable when the instance is good such that is small, or when we are seeking for a highly accurate near-optimal solution such that is small.
: Thanks for the comment. Due to the space limit, the discussions on the proof idea have not been added. We will for sure have the discussion. Please find below for a preliminary one.
``Our algorithm is motivated by the resolving LP heuristics in online LP/resource allocation literature (e.g. Vera and Banerjee (2021), Li and Ye (2021)). We can naturally interpret the right-hand-side constraints of the LP as the resource capacities and at each round , the variables and can be interpreted as the remaining capacities. Then, a key step in the proof is to establish that the average remaining capacities, and behave as (sub-)martingales. As a result, we can apply concentration properties to show that the remaining capacities will diminish when we arrive at the end of the horizon, i.e., the resources have been utilized well. Moreover, since we have already identified the optimal basis in Algorithm 1 and we resolve the LP sticking to the optimal basis, we can show that when the resources are well utilized, the total reward we collect is very close to the optimal reward. In this way, we obtain a bound over the total reward collected by our policy and that of the optimal policy, which then transfers into the sample complexity bound.''
: Thanks for the question! We are happy to add the numerical results. We have conducted basic experiments. Please refer to the ``global response'' for more details.
Thanks the authors for the clarification and answering my questions. I am happy to maintain my current score.
Thank you so much for acknowledging our response!
This paper addresses the reinforcement learning problem for CMDPs. The authors derived a problem-dependent sample complexity bound that is , improving upon the state-of-the-art. They introduce a novel way to characterize the hardness of CMDP instances using the LP basis, enabling problem-dependent guarantees. The proposed algorithm involves an elimination procedure to identify an optimal basis and a resolving procedure that adapts to remaining resources, ensuring the policy remains near-optimal with fewer samples.
优点
The paper is well written, and the intuition/ideas behind the algorithm and theoretical proofs are clearly explained, making the paper a pleasant read. I've learned something interesting and new.
缺点
While this point may seem minor since the problem setting assumes a tabular formulation with finite and fully observable state and action spaces, it is important to note that the methodology becomes challenging to apply when dealing with large or infinite state spaces where function approximation is required.
问题
N/A
局限性
Suggestions:
- A typo in 1.1 Preliminaries line 53: "stochaastic reward" -> stochastic reward
- Define N in line 120
- In RL, the notation is usually reserved for action-value function according to policy . Thus, it may be a bit disorientating for people from the RL community to see as a notation for occupancy measure. Perhaps use a different notation? I've seen paper using or for defining occupancy measures.
- On line 244, I believe "...satisfy the condition in Theorem 2.1" should be "...Lemma 2.1".
We would like to thank you for your positive review and insightful comments! Please find below our response to your comments and questions!
: Thank you so much for the comment! Indeed, the method developed in this paper is mainly for the tabular setting. However, the method can also be extended to handle the setting with possibly large state/action space. To be specific, we can utilize a linear function approximation and similarly write a LP with the coefficients in the linear approximation as the decision variables of the LP. We are currently working on this extension. Combining our method with more general function approximations will be a future topic for us to explore.
: Thank you for pointing it out! We will correct it.
: Thank you! refers to the number of episodes.
: Thank you for the suggestion! Indeed, or is a better notation for the occupancy measure.
: You are right! Thanks for the correction!
The strength of this paper is that it provides strong sample complexity results for the constrained MDP, enhancing the existing analysis in the literature by developing a new algorithm. However, despite presenting a promising method, it lacks thorough comparisons with existing methods in the literature.
优点
The strength of this paper is that it provides strong sample complexity results for the constrained MDP, enhancing the existing analysis in the literature by developing a new algorithm.
缺点
Although the paper presents a promising method, it does not present more through comparisons with existing methods in the literature.
问题
- Please define the notation [K] in page 2
- I wonder if the constraint (2) is commontly used in the literature. Please add some discussions on the constraint (2) and if they are used in other papers.
- The term "problem instance hardness" is frequently used in the introdcution part, but it is not familier with the most readers in my opinion. Therefore, it is necessary to clearly define what is the problem instance hardness.
- Although the authros develop some promising algorithms, it seems that comparison with other approaches is still weak. Therefore, it would be better if some thorough discussions with existing works is added.
局限性
The authors properly addressed the limitations of the paper in the document.
We would like to thank you for your insightful comments! Please find below our response to each of the weakness points and the questions you posted. We hope that our response would clarify your concerns about the paper.
: Thank you for the comment! We will for sure provide a better literature review and comparison with previous methods. Briefly speaking, our work presents a new algorithm. We adopt an occupancy measure representation of the optimal and obtain an LP to work with, which is similar to the previous work. However, our algorithm resolves an LP and operates in the primal space, which is fundamentally different from the previous work that adopts a primal-dual update (e.g. Stooke et al., 2020, Ding et al., 2022, Zeng et al., 2022, Bai et al., 2023, Moskovitz et al., 2023, ). There is also work developing primal-based algorithms, for example (Liu et al., 2019; Chow et al., 2018; 2019; Dalal et al., 2018, Xu et al. 2021). Our algorithm is completely different from the previous work and we obtain new results. The result is that we are able to obtain an instance-dependent sample complexity the first time in the literature, which improves upon the worst-case sample complexity established in the previous work. Though the constrained optimization approach and the Lyapunov approach have also been developed for CMDP problems, they do not enjoy a theoretical guarantee. In comparison to the literature, we develop a new primal-based algorithm and achieve the first instance-dependent sample complexity for CMDP problems.
We will add a more comprehensive literature review. We would be appreciative if you could point out the papers that we haven't discussed and we will add discussions of those.
: Thanks. The definition is .
: Thanks for the comment! Indeed, there are other formulations of the constraints in CMDP, for example,
$
V_k(\pi, \mu_1)=\mathbb{E}\left[ \sum_{t=0}^{\infty}\gamma^t\cdot c_k(s_t, a_t)\mid \mu_1 \right] \geq \lambda_k, ~~\forall k\in[K].
$
in a series of work that studies safe reinforcement learning. However, the above formulation can be transferred from our formulation in constraint . One can set for each , and it is easy to see that the two inequalities are equivalent to each other,
$
\mathbb{E}\left[ \sum_{t=0}^{\infty}\gamma^t\cdot c_k(s_t, a_t)\mid \mu_1 \right] \geq \lambda_k \Leftrightarrow \mathbb{E}\left[ \sum_{t=0}^{\infty}\gamma^t\cdot (1-c_k(s_t, a_t))\mid \mu_1 \right] \leq \alpha_k.
$
Therefore, we can equivalently use the formulation in constraint with the cost function defined as for each .
: Thank you! Please note that when deriving instance-dependent learning, it is important to define a measure to describe how difficult it is to separate the optimal policies from the sub-optimal ones. The importance of defining such a measure has been illustrated in other problems, such as multi-arm-bandit problems (e.g. Lai and Robbins (1985)) and reinforcement learning problems (e.g. Auer et al. (2008)). There are also other works studying how to characterize such a measure for instance-dependent learning on general sequential decision-making problems, for example Wagenmaker and Foster (2023). This measure is usually defined as the gap (a positive constant) between the value of the optimal policies and the value of the best sub-optimal policies. However, since the optimal policies for CMDP problems are randomized policies, the sub-optimal policies can be arbitrarily close to the optimal ones. In our paper, we show that if we restrict the policies to the ones represented by the corner points, then such a gap can be characterized as the difference between the optimal corner points and the sub-optimal corner points. Suppose this gap is , then it requires number of samples to identify the optimal corner point. In summary, ``problem instance hardness'' is a measure of the number of samples needed to separate the optimal policies from the sub-optimal ones. Our corner point characterization motivates our entire approach.
: Thank you! We will provide a better comparison with existing methods. Please refer to our response to the weakness part.
This paper studies reinforcement learning problem under Constrained Markov Decision Processes (CMDPs). It formulates the problem using linear programming and designs a novel algorithm to solve it. Using the newly designed algorithm, the authors prove a sample complexity of , albeit at the expense of having some additional factors.
优点
-
The sample complexity analyzed in this paper breaks the barrier of which is known as the lower bound for the problem that the paper studies.
-
The algorithm proposed in this paper is novel. Under the linear programming framework, It designs an algorithm which only focuses on the LP basis (corner points of the feasible region). The algorithm run in iterations and in each iteration, the order of the samples collected is independent of . Unlike some traditional methods that operate in the dual space or use primal-dual techniques, this algorithm operates directly in the primal space.
缺点
-
It has many additional dependencies such as an additional , , and etc., compared to other complexity bounds.
-
Since the algorithm focuses on the corner points, it defines the separation gap between the optimal corner point and the sub-optimal corner points. They are defined to ensure the estimation errors are bounded to distinguish the optimal policy from sub-optimal policies. However, since the algorithm outputs stochastic policies, it is possible that the sub-optimal policies are very close to the optimal policy. Similarly, is the minimum gap in the dual values when some constraints are excluded. If the constraints do not change the value of the dual problem much, will also be small. Therefore, an additional term might not be worth substituting for .
-
In addition, from the definitions of and , it is very likely they will be small terms. For example, for rarely visited states in and small eigenvalues for .
-
Due to the above concerns, it would be better to conduct experiments showing that the samples needed under such a framework are indeed fewer than those needed for other algorithms achieving . However, there are no numerical experiments done in this work. Thus, it is hard to tell whether the newly proposed algorithm is more efficient or not in reality.
问题
Could you theoretically provide some cases to illustrate when those additional dependencies do not make the bound worse? Besides, it will better if the cases provided are not edge cases.
局限性
There are no potential negative societal impact concerns for this theoretical work.
We would like to thank you for your positive review and insightful comments! Please find below our response to the weakness and your question. We hope that our response would clarify your concerns regarding our work.
: Thank you for the comment! You are right that our bound has additional dependencies on the model parameters, however, this seems to be common for instance-dependent learning and has shown up in other works. It is easy to understand that when we obtain a better dependency on , we would suffer from a worse dependency on other parameters. However, it is important to note that the additional dependency on the model parameters is fixed and is independent of . Therefore, when we are seeking a highly accurate near-optimal policy and set to be small, our bound will be a better one. Please further refer to our response to your question for a more detailed explanation.
: Thanks and you are right that for some instances, the parameter can be small, making it difficult to separate the optimal policy from the sub-optimal ones and leading to a large bound. However, please note that no matter how small can be, the parameter is independent of . Therefore, even if is very small, as long as we are seeking a solution with high accuracy, i.e., the error term is also very small, it could still be desirable to use to substitute .
: Thanks for the comments! You are right that for some bad instances, the dependencies on other problem parameters can be large. However, these parameters are always independent of . Therefore, when we want to find a near-optimal policy with a small , the bound is desirable even if the dependencies on other parameters are large. Please refer to our response to your question below for more detailed explanations.
: You are right that it is better to conduct some numerical experiments to support our results. Due to the time limit, we conduct basic experiments. Please refer to the ``global response'' for more details on our numerical experiments.
: Thank you so much for the question which allows us to provide further clarifications! Please note that though our bound has some additional dependencies on the problem parameters, they are independent of the accuracy level . To be specific, for a problem instance , denote by the constant term in our bound. Then, our sample complexity bound is on the instance . The worst-case sample complexity bound established in the previous literature is . Please note that is the accuracy level that we can decide. Therefore, for any problem instance , as long as we set small enough such that , our instance-dependent bound will be better than the worst-case bound . That being said, for any problem instance, even if the problem instance is not that favorable such that the constant term in our bound is large, our bound can always be better than the worst-case as long as we set small enough, i.e., we are seeking for a policy with low error.
Thank you for the clarification and my questions are mostly addressed. I will keep my score the same.
Thank you for acknowledging our response!
We implement our algorithm to study the numerical performance. We consider a CMDP problem with the state space and the action space . We set the discount factor . We then randomly generate the probability transition kernel . To be specific, for each state , action , and the future state , we uniformly generate a randomly variable . Then, the transition probability is defined as . For each state-action pair , the expected reward is uniformly generated from the interval (with the reward for the first action set to be ). The actual reward , where is uniformly distributed among . There are constraints and for each constraint and each state-action pair , we define the expected cost to be uniformly generated from . The actual cost , where is uniformly distributed among .
For each total iterations , We apply our algorithm and obtain the output . We compare with the optimal occupancy measure and we define the error term as . We study how the error term scales with . The results are displayed in the attached PDF. As we can see, the error term drops to within iterations and it keeps improving as we have more and more iterations. The computation time of our algorithm is also fast in that in each iteration, we only need to solve a set of linear equations. These evidences demonstrate the numerical efficiency of our algorithm and we will conduct more involved experiments in the future work.
This paper derives optimal problem-dependent guarantees for CMDPs. In particular, the paper proposes a primal only algorithm and prove that the resulting algorithm has an instance-dependent sample complexity bound.
The reviewers agree that the paper is well-written, and its contributions merit acceptance. After carefully reading the paper and the corresponding discussion, I tend to agree. Please incorporate the reviewers' feedback. In particular, addressing the following concerns will help strengthen the paper:
- Include the experiment done as part of the author response. An empirical comparison with an existing approach (e.g. primal-dual policy gradient) will be helpful.
- Explain whether the proposed algorithm (without any modification) retains the problem-independent sample-complexity
- Better situate the resulting algorithm and compare to the existing literature (response to Rev. w3CH)
- Add the discussion about the minimax lower-bound (response to Rev. hDFd)
- Compare to the previous bounds in terms of the dependence in , (and not just )
- Add a proof sketch. Including some more details about the online LP literature and better connecting it to the CMDP problem will enhance the paper's readability for the audience working on constrained RL.
- Clearly explain the dependence on the constants in Theorem 5.2. What is tight and what can be relaxed? Conjecturing a lower bound and having a discussion section will be helpful.