PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差0.7
5
6
7
6
3.3
置信度
正确性2.5
贡献度2.5
表达2.8
NeurIPS 2024

Contextual Bilevel Reinforcement Learning for Incentive Alignment

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
Reinforcement LearningBilevel OptimizationContextual MDPsEnvironment DesignModel Design

评审与讨论

审稿意见
5

This paper proposes a general approach for bilevel optimization where the lower-level component is modeled as a Markov Decision Process (MDP). Traditional solvers for bilevel optimization often involve the computation of the Hessian term, which can be complex and computationally intensive. The main strategy employed here leverages the properties of entropy-based reinforcement learning (RL), which offers a closed-form solution. Consequently, the lower-level problem is simplified to a best response function, streamlining the training process. The authors also explore additional scenarios, including one where the upper-level objective is a discounted reward and another where the leader can further influence the follower.

Experiments conducted in a four-room environment demonstrate the effectiveness of the proposed method.

优点

The problem setting of bilevel optimization in reinforcement learning (RL) is crucial for the community.

The authors provide a detailed theoretical analysis of their proposed methods.

缺点

  1. The methods proposed might only be used when the lower-level components have a closed form, as the algorithm relies on the closed form provided by entropy-based reinforcement learning (RL). What happens in cases where entropy RL is not used in the lower-level MDP, such as in meta RL?

  2. The authors propose various scenarios, including meta RL and macroeconomic modeling, but only test the method in the Four-Rooms environment. Could the authors also conduct experiments on meta RL or economic models, and compare relevant baselines?

  3. Proposition 9 follows the established work closely, which limits its theoretical contributions.

  4. The authors propose having full access to the lower-level components, but this does not seem to be utilized in the experiments. It would be beneficial to include experiments that make use of this access.

问题

Please answer my question mentioned above.

局限性

Please refer to the weaknesses section.

作者回复

We thank the reviewer for their helpful comments.

Q1: Entropy Entropy regularization allows us to compute the hypergradient explicitly without resorting to implicit differentiation. Without the entropy term, the lower level admits multiple solutions. The problem thus becomes ill-posed as one needs to decide which optimal policy to use in the upper-level objective and our method is not directly applicable.

However, with regularization, our problem is still harder than classical stochastic bilevel optimization with a strongly convex lower level. Moreover, important RL applications use regularization. For example, recent work has framed RL with human feedback (RLHF) as a BO-CMDP with lower-level entropy regularization [Chakraborty et al., 2024]. We will include this example in the next version. Additionally, prior work on bilevel RL also considered lower-level regularization. In fact, [12, 39] have shown that entropy-regularized RL approximates the unregularized problem as λ0\lambda \rightarrow 0 by bounding the gap between the regularized and unregularized problems in both the upper and lower-level objectives.

Concerning meta-RL; in Sec. 2.2 we present a formulation with KL divergence instead of entropy regularization. The optimal policy in this case is also a softmax of the form πx,ξ(s;a)exp(Qλ,x,ξ(s,a)/λ+log(π~(s,a)))\pi^*_{x,\xi}(s;a)\propto \exp(Q^*_{\lambda,x,\xi}(s,a)/\lambda+\log(\tilde{\pi}(s,a))), with π~\tilde{\pi} being the target policy [Vieillard et al., 2020]. Therefore, our analysis can extend to this setting. We will clarify this in the next version.

Q2: Experiments Thank you for the nice suggestions. The current experiment already covers the important economic application of Principal-Agent reward shaping, as motivated in Sec 2. We will clarify this connection in the final version. We would like to emphasize that the primary contribution of our paper is the novel problem setting, algorithmic design, and the convergence of HPGD, providing the first upper bound on the complexity of stochastic bilevel RL. For comparison to prior works and their complexities, see Table 1 in the attached PDF. It illustrates that our setting is more general and the randomized algorithm more practical.

Including more experiments would definitely be interesting. While Meta-RL, RLHF, and Dynamic Mechanism Design are significant applications of BO-CMDP, these are very active research areas and are worth a more thorough investigation, which we plan to address in future work.

Q3: Prop. 9 On a high level, the proof of Prop. 9 works similar to [30], as both papers use multi-level Monte Carlo (MLMC). However, on a more detailed level, there are several unique challenges in BO-CMDP, not covered in [30].

The first challenge (cf. Sec. 3.4) is that in [30] samples generated to estimate the hypergradient are independent from the lower-level decision variable. This is crucial to control the variance. When calculating their respective estimates ddxFtk+1\frac{d}{dx}F_{t_{k+1}} and ddxFtk\frac{d}{dx}F_{t_k}, they can sample once and use the same data for both hypergradient estimates, such that the variance of ddxFt1+pk1[ddxFtk+1ddxFtk]\frac{d}{dx}F_{t_1}+p_{{k}}^{-1}\left[\frac{d}{dx}F_{t_{{k}+1}}-\frac{d}{dx}F_{t_{{k}}}\right] is easy to bound. In our case (cf. Alg. 1), the hypergradient estimate depends on an action aa and a trajectory following aa, both generated by the lower-level policy. Thus, the data used to estimate hypergradient depends on the lower-level decision and one cannot use one trajectory to estimate both ddxFtk\frac{d}{dx}F_{t_{{k}}} and ddxFtk+1\frac{d}{dx}F_{t_{{k}+1}}. On the other hand, sampling different trajectories, according to πtk\pi_{t_k} and πtk+1\pi_{t_{k+1}} would significantly increases variance.

The second challenge is that the estimator of the advantage derivative xAπtk^(s,a)\widehat{ \partial_x A^{\pi^{t_k}}}(s,a), computed in Alg. 2 has itself a high variance. This issue does not arise in [30], as they do not consider RL in the lower level.

To address the first challenge, we can use importance sampling, i.e. sampling the action aa from πtk+1\pi_{t_{k+1}} for both estimators with an additional factor πtk(a;s)πtk+1(a;s)\frac{\pi^{t_k}(a;s)}{\pi^{t_{k+1}}(a;s)} for the second one.

To address the second challenge, we can compute xAπtk^(s,a)\widehat{ \partial_x A^{\pi^{t_k}}}(s,a) by averaging over 2k2^k trajectories to control the noise.

By addressing these two challenges as described, we obtain a desired variance bound of O(K)\mathcal{O}(K), which further strengthens Proposition 9. In the final version, we will clarify the challenges and the technical contributions we make in overcoming them.

To conclude, while the high-level proof ideas of Prop. 9 follows [30], extending their results to the class of BO-CMDP is important and non-trivial, as the latter poses unique challenges not addressed in previous works.

Q4: Lower-level Access To clarify, BO-CMDP is a broad class of problems where the upper level can influence some (but necessarily all) of the lower-level MDP's rewards, transitions and initial distribution. Contrary to previous works, we do not allow full access/control over how the lower-level problem is solved. The only exception is Sec. 3.4, where we discuss how to leverage full access to accelerate HPGD.

Our experiments study reward shaping, which is an important instance of BO-CMDP. While conducting experiments in other settings is interesting, we leave them for future investigation (see our answer to Q2). We hope this answers the question. Otherwise, please clarify what is meant by "access to lower-level components".

Final Remarks We hope our clarifications and proposed changes address the reviewer's questions, in which case, we greatly appreciate it if you could consider raising the score appropriately.

New References

Chakraborty et al., 2024. PARL: A unified framework for policy alignment in reinforcement learning from human feedback.

Vieillard et al., 2020. Leverage the average: an analysis of kl regularization in reinforcement learning.

评论

Thank you for your detailed response. Some of my concerns have been adequately addressed, so I have decided to increase my score from 4 to 5.

However, I still have some concerns regarding Q1 and Q2:

  1. Could you clarify more on the following sentence: "However, with regularization, our problem is still harder than classical stochastic bilevel optimization with a strongly convex lower level." My understanding is that if we have a best response function, the bilevel optimization problem would degenerate into a (single) minimization problem, which would make the corresponding theoretical analysis easier than that of bilevel optimization with a strongly convex lower level or bilevel optimization with a non-convex lower level [1].

  2. I still believe the experiments are somewhat simplistic, as also pointed out by Reviewer Y6ix and Reviewer UZX1. While I understand that the authors may be focusing more on problem settings and theoretical analysis, I think it is important to test the proposed method in various environments to validate its effectiveness.

[1] Liu, Risheng, et al. "A value-function-based interior-point method for non-convex bi-level optimization." International Conference on Machine Learning. PMLR, 2021

评论

Thank you very much for taking the time to reply and raising your score. We appreciate your feedback and will take them into consideration.

We would like to further clarify your question about the best-response policy in bilevel RL. Indeed, entropy regularization gives the lower level a closed-form solution, such that bilevel RL reduces to a single-level problem. However, the closed-form solution of the lower-level problem depends on the optimal Q function which is unknown to the decision maker. As a result, one still needs to solve the lower-level contextual MDP to compute the hypergradient, which is generally more complicated than solving a static strongly convex optimization problem. From an algorithmic perspective, the additional structure introduced by entropy regularization thus does not make the problem easier to solve.

As for Liu et al [1], thank you for pointing out the interesting reference. We will definitely include it in the next version. Bilevel optimization with nonconvex lower-level problems is generally very hard to solve. For instance, [1] only provides asymptotic convergence results without iteration or sample complexities. To our knowledge, entropy-regularized bilevel-RL (in our case even with additional lower-level contextual MDPs) is one of the very few bilevel problems with nonconvex lower-level problems with non-asymptotic convergence guarantees.

As you pointed out, our contribution focuses on introducing the bilevel contextual RL model, the algorithm design, and the theoretical convergence analysis. The summary table in the one-page additional PDF has illustrated the board applicability of our model and the superiority of the theoretical results. Supporting the theory, our experiments illustrate how HPGD works well in practice for one of the central applications and motivations of our work—reward shaping. Conducting experiments in various settings and in specific applications such as RLHF are by themselves very interesting already, which we aim to study thoroughly in future work.

We hope this addresses your questions and helps your assessment of our work.

审稿意见
6

This paper introduces the framework of bilevel optimization with lower-level contextual MDPs (BO-CMDP), which covers a wide range of practical decision-making applications. The authors of this paper propose the Hyper Policy Gradient Descent (HPGD) algorithm for solving BO-CMDP. They prove the convergence of HPGD theoretically and demonstrate the practical efficiency of HPGD in experiments.

优点

This paper introduces an interesting framework of BO-CMDP and shows that many practical applications can be formulated as BO-CMDP problems. The authors propose HPGD and prove its convergence. The authors also verify the practical efficiency of HPGD by experiments.

缺点

There are several typos in the statements and the proofs of the main results of this paper.

问题

  • In line 551, the authors said they used the results from [30]. It seems like they used Lemma 1 in [30] which requires stepsize α1/(2Sf)\alpha\leq 1/(2S_f), but the authors did not mention this requirement for stepsize α\alpha in either the main text or the appendix of this paper. Moreover, the coefficient before SfS_f in [30] is 1, whereas in this paper, the coefficient is 2. Could the authors explain the reasons for this discrepancy?

  • Assumption 3.1 requires further clarification. In line 551, the authors used the infinity norm, but in Assumption 3.1, the specific norms used for f\nabla f were not specified.

  • Typos:

In line 551, should LL be LfL_f?

In line 188, the (s,a)(s,a) on the left side of the equation should have subscripts about tt.

In line 599 and subsequent lines, dlogPx(s;s,a)d\log P_x(s’;s,a) missed /dx/dx.

局限性

N/A

作者回复

We thank the reviewer for taking the time to evaluate our work and appreciate their comments.

For Assumption 3.1, we will clarify that the statement is with respect to the infinity norm. We will also add the assumption on the stepsize from [30] and fix the typo in line 551, where the extra 2 is not needed, as well as all other typos pointed out. Note that the results remain unchanged. We greatly appreciate your help to present the paper in its best possible form. In the attached one-page PDF, we add a summary table comparing the differences between our work and prior works that only consider deterministic updates without context information.

If you have additional questions, please reach out to us during the discussion period and we will be happy to clarify. If you have no further concerns, we would appreciate it if you could consider raising the score as you see appropriate.

评论

I thank the authors for the responses, which help reinforce my positive view of this work.

评论

Thanks for your time and acknowledgement!

审稿意见
7

The paper introduces BO-CMDPs, a form of two-level optimization problems in which the inner optimization is a contextual MDP and the outer optimization problem selects the injected context based on some objective. The authors connect this formulation to a range of interesting domains (meta-RL, economics & reward shaping), before proposing a gradient-based algorithm for solving these problems. Depending on the level of insight the outer optimiser has into the workings of the inner different convergence guarantees are established before the algorithm is assessed on a toy example. An ablation into the role of several hyperparameters is also provided.

优点

Overall the work is well-presented and easy to follow.

The introduced problem setting is well-motivated and connected to existing work. The example applications are of high interest to the community and lend further importance to the proposed formulation.

Technical contributions and theoretical analysis appears sound and the empirical examination supports the authors claims.

缺点

Arguably, the main limitation of the paper is the simple nature of the toy example.

While the provided results appear promising and the examinations of the learned incentive maps illustrate the viability of the algorithm, they leave questions regarding the computation complexity and scalability of the proposed method. I think such questions could be answered by returning to one of the motivating examples (e.g. Meta-RL or Reward Shaping tasks) and applying HPGD to some moderate-to-large-scale settings. This would also lend further credibility to the papers motivating settings.

问题

A minor point: Would it make sense to rename the problem setting to Bi-cMDP? "BO" is quite overloaded with Bayesian Optimisation and might be slightly confusing for some readers.

局限性

The authors assumptions are stated were necessary. The algorithms limitations and room for future work are briefly outlined in the final section of the paper.

作者回复

We thank the reviewer for their thoughtful evaluation and valuable comments.

Experiments. Thank you for the nice suggestions. The current experiment already covers an important economic application of Principal-Agent reward shaping, as initially motivated. We will clarify this connection in the final version of the paper. In addition, we would like to emphasize that the primary contribution of our paper is the novel problem setting, algorithmic design, and the convergence for HPGD, providing the first upper bound on the complexity of bilevel RL in a stochastic setting. For comparison to prior works and their complexities, we add a summary table in the attached PDF. It illustrates that our setting is more general and our randomized algorithm more practical.

Including more experiments would definitely be interesting. While Meta-RL, Reinforcement Learning with Human Feedback, and Dynamic Mechanism Design are significant applications of BO-CMDP, these are very active research areas and are worth more thorough investigation, which we plan to address in future work.

Naming. We really appreciate your suggestion to rename the problem setting. We agree that both "BO" and "CMDP" are overloaded terminology with “Bayesian Optimisation” and “Constrained-MDP”, respectively. We will consider renaming the problem setting.

We sincerely hope that our responses have answered your questions and concerns, in which case, we would appreciate if you could raise your scores appropriately. If you have additional questions, please reach out to us during the discussion period and we will be happy to clarify.

评论

I thank the authors for addressing the questions raised by the other reviewers and myself. I have adjusted my score to match the proposed improvements to the paper.

评论

We are glad that you appreciate our responses. Thank you for your time and acknowledgement!

审稿意见
6

This paper considers the bilevel reinforcement learning problem with lower-level contextual MDPs. As compared to existing works, the problem considered is more general. The paper proposes a hyper-gradient based method. The hyper-gradient can be directly evaluated by leveraging the close form optimal policy in terms of optimal value function, enabled by the entropy regularization in the lower level. Furthermore, the paper establishes non-asymptotic convergence for the proposed method.

优点

The strength of this work can be summarized as:

  1. The paper is well-written and easy to follow.
  2. Compared to previous works where the lower level problem does not include context, the lower-level contextual mdp considered in this work is more general and includes more applications.
  3. The close form hyper-gradient facilitates practical implementation.

缺点

Though the paper offers experiments showing the effectiveness of their method, the experiments might be a bit boyish. I believe the work could benefit more from including more scaled up experiments like meta RL listed in the application section.

问题

How does the complexity of the proposed algorithm compare to the existing methods? A table of result comparison might be beneficial for putting this work in context.

局限性

no potential social impact

作者回复

We thank the reviewer for taking the time to evaluate our work and appreciate their comments.

Experiments. Thank you for the nice suggestions. The current experiment already covers an important economic application of Principal-Agent reward shaping, as initially motivated. We will clarify this connection in the final version of the paper. In addition, we would like to emphasize that the primary contribution of our paper is the novel problem setting, algorithmic design, and the convergence for HPGD, providing the first upper bound on the complexity of bilevel RL in a stochastic setting.

Including more experiments would definitely be interesting. While Meta-RL, Reinforcement Learning with Human Feedback, and Dynamic Mechanism Design are significant applications of BO-CMDP, these are very active research areas and are worth more thorough investigation, which we plan to address in future work.

Complexity comparison. We followed your suggestion of providing a Table of comparison with prior works to highlight our contribution. It can be found in the attached PDF and will be included in the final version. The related works, which we compare to are listed below. Note that two of them were published only recently and were thus not yet included in our submission.

  • [Chakraborty et al., 2024] Chakraborty, S., Bedi, A., Koppel, A., Wang, H., Manocha, D., Wang, M., and Huang, F. (2024). PARL: A unified framework for policy alignment in reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations.

  • [Shen et al., 2024] Shen, H., Yang, Z., and Chen, T. (2024). Principled penalty- based methods for bilevel reinforcement learning and rlhf. arXiv preprint arXiv:2402.06886.

  • [Chen et al., 2022] Chen, S., Yang, D., Li, J., Wang, S., Yang, Z., and Wang, Z. (2022). Adaptive model design for markov decision process. In International Conference on Machine Learning, pages 3679–3700. PMLR.

The table emphasizes that in our work we consider stochasticity in the hypergradient estimates, at the upper level (via the context), and at the lower level when using Q-learning. None of the three factors has been considered in previous work. Note that these three points are essential to design a scalable and practically relevant algorithm. Furthermore, we emphasize that if we do not have a stochastic hypergradient estimate, i.e. the variance of our estimator is zero, the last term α\alpha vanishes in our Theorem 3. This improves the convergence rate of HPGD to O(1/T)\mathcal{O}(1/T), which translates to O(ε2)\mathcal{O}(\varepsilon^{-2}) complexity in the upper level in Table 1 and our analysis thus exactly recovers the deterministic rates of previous works. Theorem 3 therefore both subsumes and significantly extends the previous results in the literature.

We hope that our additional comparison table provides valuable insights for the reviewer, in which case, we would appreciate if you could consider raising the scores appropriately.

作者回复

We thank all reviewers for their efforts and helpful comments. We appreciate that all reviewers recognized our contributions and efforts. Specifically, they highlighted:

  • Our "well-motivated" (UZX1) problem formulation (BO-CMDP) and “interesting framework” (miD3) that is "crucial for the community" (Gnk4) with "example applications [...] of high interest to the community and [that] lend further importance to the proposed formulation" (UZX1)
  • The “detailed technical analysis” (Gnk4) that establishes non-asymptotic convergence guarantees.
  • The style and writing of the work. In particular, we were happy to reviewers found the paper "well-presented" (UZX1), "well-written" and "easy to follow" (Y6ix).

Reviewer Y6ix asked us to include a Table, summarizing our work to other results on bilevel RL, which we were more than happy to do. The resulting comparison can be found in the PDF. The table emphasizes that our work is the first to consider contextual lower-level MDPs, stochasticity in the hypergradient estimates, and randomized algorithms to solve the lower level problem (when using Q-learning). None of these three factors was ever considered in previous work. We note, when the hypergradient estimator becomes deterministic, we recover exactly the same upper-level convergence rates as previous works. Moreover, when using stochastic updates, our randonmly-truncated Q-learning (RT-Q) algorithm significantly reduces the number of lower-level iterations needed.

最终决定

Summary: The paper addresses bilevel reinforcement learning with lower-level contextual MDPs, introducing a gradient-based method called Hyper Policy Gradient Descent (HPGD). It establishes non-asymptotic convergence for HPGD and demonstrates its effectiveness through experiments.

Strengths:

  • The framework of bilevel contextual MDPs (BO-CMDPs) is a significant and novel contribution with broad practical applications.
  • The proposed HPGD algorithm is theoretically sound, with proven convergence and practical efficiency.
  • The paper is well-written and clearly presented, making complex concepts accessible.

Weaknesses:

  • The experiments are limited to a simple toy example; more diverse and large-scale applications are suggested.
  • The paper’s theoretical analysis and results are somewhat constrained by reliance on closed-form solutions provided by entropy-based RL.