PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.0
置信度
创新性2.8
质量3.0
清晰度3.0
重要性2.5
NeurIPS 2025

On the Sample Complexity Bounds of Bilevel Reinforcement Learning

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

Provide First Ever Sample complexity bounds for a first order bi-level algorithm.

摘要

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.
关键词
Bilevel Reinforcement LearningSample ComplexityGradient DescentBilevel Optimization

评审与讨论

审稿意见
4

This paper investigates bilevel RL from a theoretical lens. Their algorithm uses penalty based methods and they prove that their algorithm admits a sample complexity of O(ϵ3)O(\epsilon^{-3}).

优缺点分析

Strengths:

They the first sample complexity bound for bilevel RL.

They improve the sample complexity for general bilevel optimization from ϵ6\epsilon^{-6} to ϵ3\epsilon^{-3}.

Weaknesses:

They do not prove a lower bound to show that their upper bound is tight.

What are some examples of G(θ,λ)G(\theta, \lambda)? While Assumption 3 is present in [1], it seems this is a strong assumption and not reasonable in RL. Line 133-134 states ``However, such assumptions (unbiased estimates of the upper and lower level objectives) do not hold in BRL, where gradient estimates are inherently biased due to the underlying MDP dynamics.'', which seems to contradict Assumption 3.

问题

What is the purpose of the KL term in the reward function? This seems like a very contrived way of tying BRL to RLHF. Do the results still hold without the KL regularization term?

Can you prove a lower bound in this setting?

I think these may be typos, but disregard if they are not:

  • Page 9 line 296, should it be \eps3\eps^{-3} instead of \eps4\eps^{-4}?
  • It seems some of the references are duplicated, e.g. 1 and 2, 21 and 22, etc.

局限性

The limitations are discussed in the paper.

最终评判理由

My recommended score would be a 3.5, but this is not a option. While they do improve the dependence on \epsilon, they do not give a good justification on why the bound is good, i.e., no lower bound. Even though related works do not provide lower bounds, they do not give intuition or an explanation on the difficulties of improving their bound.

格式问题

None

作者回复

Weakness 1 They do not prove a lower bound to show that their upper bound is tight.

Response to Weakness 1: Thank you for noting this point. Establishing lower bounds for bilevel problems with non-convex lower levels remains open to our knowledge, the only sample-complexity lower bound currently available is for the convex-lower-level setting (Ji et al., 2024). In the non-convex regime we study, no general lower-bound technique has yet been developed, and recent surveys list it as an active research challenge. Consequently, none of the works we benchmark against, including Chakraborty et al.(2024), Shen et al. (2024), and Yang et al. (2025), provide lower bounds either.

While a formal tightness result is therefore out of reach at present, our upper bound improves on prior rates (from O(ε6)/O(ε7)\mathcal{O}(\varepsilon^{-6}) / \mathcal{O}(\varepsilon^{-7}) to O(ε3))\mathcal{O}(\varepsilon^{-3})) and, we hope, will help guide future efforts toward proving lower bounds in the non-convex bilevel setting. We will add this context to the paper to clarify why a lower-bound analysis is not yet available and to highlight it as an important direction for future work.


Weakness 2 What are some examples of G(ϕ,λ)G(\phi,\lambda)?

Response to Weakness 2: An example of G(ϕ,λ)G(\phi,\lambda) is the preference function defined as

G(ϕ,λ)=E(σ1,σ2)π.(λ),y(yPϕ(σ1>σ2)+(1y)Pϕ(σ1<σ2))G(\phi,\lambda) = \mathbb{E}_{(\sigma_1,\sigma_2) \sim \pi.({\lambda}) ,y} (y{\cdot}P^{\phi}( \sigma_1 > \sigma_2 ) + (1-y){\cdot}P^{\phi}( \sigma_1 < \sigma_2 ))

Here Pϕ(σ1σ2)P^{\phi}(\sigma_{1} \ge \sigma_{2}) is defined as

Pϕ(σ1σ2)=exp(Q(σ1))exp(Q(σ1))+exp(Q(σ2))P^{\phi}(\sigma_{1} \ge \sigma_{2}) = \frac{ exp( Q(\sigma^1)) }{ exp( Q(\sigma^1)) + exp( Q(\sigma^2)) }

Where exp(Q(σk))=i=1Hrϕ(sik,aik)exp( Q(\sigma^k)) = \sum_{i=1}^{H} r^{\phi}(s^k_i, a^k_i) represents the cumulative reward of the trajectory σk=(sik,aik)i(1,H)\sigma_{k} = (s^k_i, a^k_i)_{i \in (1,H)}

Here G(ϕ,λ)G(\phi,\lambda) is a loss function that optimizes the reward function rϕr^{\phi} such that it aligns with human preferences. The inputs for G(ϕ,λ)G(\phi,\lambda) are in the form of a tuple (y,σ1,σ2)(y,\sigma_{1},\sigma_{2}). Here, σ1,σ2\sigma_{1},\sigma_{2} are two pairs of trajectories generated by a policy π(λ)\pi(\lambda) and yy is a label which is 1, if trajectory σ1\sigma_{1} is preferred over σ2\sigma_{2} by a human and 0 if vice-versa.

We have used this G(ϕ,λ)G(\phi,\lambda) loss function in our empirical demonstration in Appendix F of the Supplementary Material. We implement our algorithm for two simulated environments, namely Walker (Deep Mind Control Suite (2013)) and Door Open (Meta World Control Suite (2025)). In both these environments, we can query the system for tuples (y,σ1,σ2)(y,\sigma_{1},\sigma_{2}) to calculate the gradient of the upper loss function. We evaluate our algorithm against the [PEBBLE] (Lee et al. 2021), which is a state-of-the-art RLHF algorithm, and show a performance improvement.


Weakness 3: While Assumption 3 is present in [1], it seems this is a strong assumption and not reasonable in RL. Line 133-134 states ``However, such assumptions (unbiased estimates of the upper and lower level objectives) do not hold in BRL, where gradient estimates are inherently biased due to the underlying MDP dynamics.'', which seems to contradict Assumption 3

Response to Weakness 3: Thank you for pointing out this ambiguity. We intended to distinguish between the two gradients in the bilevel objective:

  • Upper-level gradient (ϕG(ϕ,λ)\nabla_\phi G(\phi,\lambda)): can be estimated unbiasedly from sampled trajectories.
  • Lower-level gradient (λJ(ϕ,λ)\nabla_\lambda J(\phi,\lambda)): generally biased in practical RL because the critic Qλ(s,a)Q_{\lambda}(s,a) is only an approximation of the true QπQ^{\pi}.

Assumption 3 therefore, applies only to the upper objective. We will revise the text on lines 133–134 to read:

The assumption of unbiased upper-level gradients holds, but for the lower level in BRL, the gradient estimator is inherently biased due to function-approximation error in the critic.

Equation (12) in the manuscript already shows how this bias enters the analysis; we will cross-reference it explicitly and add citations to standard results on biased critic gradients in actor–critic methods.

We appreciate the reviewer’s careful reading and will ensure the final version makes this distinction clear.


Question 1 What is the purpose of the KL term in the reward function? This seems like a very contrived way of tying BRL to RLHF. Do the results still hold without the KL regularization term?

Response to Question 1: The KL-divergence term serves two complementary roles:

  • Practical regularization in RLHF. In reinforcement learning from human feedback (RLHF), it is customary to penalize deviation from a reference policy via a KL term (e.g., PPO-RLHF, DPO). This stabilizes training, limits reward-hacking, and keeps the updated policy close to the supervised baseline. We include the same term so that our theoretical analysis matches the algorithms widely used in practice.
  • Generality of the theory. Treating the KL weight β\beta as a parameter lets us cover both the RLHF regime (β>0\beta>0) and pure reinforcement learning (β=0\beta=0) within a single framework.

Do the guarantees depend on β>0?\beta>0? No. All convergence and sample-complexity results remain valid when β=0\beta=0. Setting β=0\beta=0 simply drops the regularizer from the reward; the proofs require no additional modifications, because the KL term is handled as an additive, Lipschitz-smooth component that can be zeroed out. We will add a remark to the paper to state this explicitly and to clarify that the KL term is optional rather than essential.


Question 2 Can you prove a lower bound in this setting?

Response to Question 2: The question of a lower bound on the bilevel optimization with a non-convex lower level is still an open question. We will mention this in future work directions.


Typos: Thank you for pointing out the typos. We will fix them in the final version.


References

Kaiyi Ji, Yingbin Liang. "Lower Bounds and Accelerated Algorithms for Bilevel Optimization". JMLR, 2022

Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, Furong Huang. "PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback". ICLR 2024

Han Shen, Zhuoran Yang, Tianyi Chen. "Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF", ICML 2024.

Yan Yang, Bin Gao, Ya-xiang Yuan "Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity". AISTATS 2025

Kimin Lee, Laura Smith, Pieter Abbeel. "PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training" .ICML 2021

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, Martin Riedmiller. "Deep Mind Control Suite". arxiv,2018

Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro. "Meta-World+: An Improved, Standardized, RL Benchmark". arxiv,2025

评论

Thank you for your response to my questions. While it may be true that there is no current lower bound for your setting, do you think that your result is tight? In other words, is there a compelling/intuitive reason as to why ϵ3\epsilon^{-3} is optimal instead of a current technical limitation?

评论
  • Thank you for your insightful question. While we do not currently know a formal lower bound for bilevel optimization with a non-convex upper-level objective and a PL-satisfying non-convex lower level, we believe our result of O(ϵ3)O(\epsilon^{-3}) is a meaningful and likely near-optimal benchmark.
  • In particular, our algorithm requires O(1/ϵ)O(1/\epsilon) outer iterations to drive the upper-level optimality gap to ϵ\epsilon, O(log(1/ϵ))O(\log(1/\epsilon)) inner iterations to approximate the lower-level solution under the PL condition, and O(1/ϵ2)O(1/\epsilon^2) samples per gradient estimate to ensure accuracy in the presence of stochasticity. These components compound multiplicatively to yield a total sample complexity of O~(ϵ3)\tilde{O}(\epsilon^{-3}).
  • We note that bilevel problems are strictly more challenging than single-level ones, where O(ϵ2)O(\epsilon^{-2}) is often achievable under PL conditions. This further suggests that our O(ϵ3)O(\epsilon^{-3}) bound is a reasonable first result for this class. However, due to the absence of known information-theoretic lower bounds for the class of problems with non-convex upper-level and non-convex (PL) lower-level objectives, we cannot make definitive claims about tightness. Establishing such lower bounds remains an open and important direction for future research.
评论

Dear Reviewer XAuT,

We sincerely thank you again for your valuable feedback, which has greatly helped us improve the quality of our work. As the discussion deadline is approaching, we kindly request you to review our rebuttal. We are happy to discuss more if you have any further concerns. Thank you again for your time and effort!

Regards,

Authors

审稿意见
4

This paper presents the first finite-sample analysis of BRL, establishing a sample complexity of O(ϵ3)\mathcal{O}(\epsilon^{-3}) in continuous state-action spaces. The contribution is both clear and sound.

优缺点分析

Strengths

  • This paper is well-written and easy to follow.
  • The contribution is clear and significant.
  • Establishing sample complexity bounds for continuous state-action spaces is an important advancement.

Weaknesses

  • Assumption 1.3 requires Lipschitz and smooth Hessians, which is a strong condition.
  • While the authors emphasize the contribution of extending from the tabular setting to continuous spaces, it remains unclear what specific technique was developed to enable this extension.
  • There are a few typos in the paper. For example, in the conclusion, the sample complexity should be stated as O(ϵ3)\mathcal{O}(\epsilon^{-3}) rather than O(ϵ4)\mathcal{O}(\epsilon^{-4}).

问题

My primary concern pertains to the technical contribution of this work.

Could you elaborate further on Assumption 1.3? Why is it necessary, and has it appeared in prior work? If not, please clarify how this new assumption relates to existing assumptions in the literature.

Furthermore, what limitations in prior work prevented the establishment of sample complexity bounds for continuous state-action spaces? What specific techniques were developed in this paper to overcome these challenges and enable the extension?

局限性

This work requires strong assumptions.

最终评判理由

All my questions have been answered, and I'm convinced of the contribution of this work.

格式问题

None.

作者回复

Weakness 1/Question 1:- Assumption 1.3 requires Lipschitz and smooth Hessians, which is a strong condition. Could you elaborate further on Assumption 1.3? Why is it necessary, and has it appeared in prior work? If not, please clarify how this new assumption relates to existing assumptions in the literature.

Response to Weakness 1/Question 1: Thank you for the thoughtful question and the opportunity to clarify Assumption 1.3.

The confusion around Assumption 1.3 appears to stem from a wording oversight in our current draft. We had inadvertently described the Hessians as “smooth,” whereas our intended assumption is that the Hessians of G(ϕ,λ)G(\phi, \lambda) (with respect to λ\lambda) and J(ϕ,λ)J(\phi, \lambda) (with respect to both ϕ\phi and λ\lambda) are Lipschitz continuous. We will revise the assumption statement accordingly (mentioned below for quick reference) in the final version to ensure clarity.

  • Assumption 1.3: The function G(ϕ,λ)G(\phi,\lambda) has Lipschitz Hessians with respect to λ\lambda. The function J(ϕ,λ)J(\phi,\lambda) has Lipschitz Hessians with respect to both λ\lambda and ϕ\phi.

The assumption is well established in the existing literature. This Lipschitz condition on the Hessians is well-established in the bilevel optimization literature. In particular, Chen et al.(2024) (See Assumption 4.1) show that such assumptions are necessary for convergence in the bilevel setting introduced in Kwon et al. (2024), which deals with non-convex lower-level problems. Additionally, similar assumptions appear in bilevel works like Ji et al. 2021 (See Assumption 3) and Yang et al. (2023) (See Assumption 2), which assume convex lower-level problems. They are also assumed in bilevel RL literature, such as Chakraborty et al. (2024) (Assumption 3).


Weakness 2/Question 2:- While the authors emphasize the contribution of extending from the tabular setting to continuous spaces, it remains unclear what specific technique was developed to enable this extension.? Furthermore, what limitations in prior work prevented the establishment of sample complexity bounds for continuous state-action spaces? What specific techniques were developed in this paper to overcome these challenges and enable the extension?

Response to Weakness 2/Question 2:

Limitations of prior works - and how we improve

ReferenceAssumed settingBlocking limitation
Chakraborty et al.(2024) Hypergradient w/ SoftMax policyDiscrete action (tabular or SoftMax)Can be extended to continuous state action space but involves computing 2nd-order terms (impractical for real-world usage)
Yang1^{1} et al.(2025) Closed-form hypergradientTabularClosed-form expressions for hypergradients obtained assuming tabular setup(See Equations 5-8). Cannot be extended to a continuous setup.
Shen et al. (2024) Penalty-based bilevelTabularStrongly-convex penalty is defined for tabular setup(See section 3.2); Strongly convex property of penalty term required for convergence. Cannot be extended to a continuous setup
All aboveProvide iteration complexity only — no sample-efficiency guarantees

Our contributions:

  • Continuous state–action space,
  • Using first-order information only: no Hessian or hyper-Hessian inversion needed,
  • Sample complexity bounds: tight bound of O(ϵ3)\mathcal{O}\bigl(\epsilon^{-3}\bigr) (see Theorem 1).

Our Key technical ideas:

  1. Bias-aware decomposition of gradient estimation error We use the bi-level optimization framework of Kwon et al. (2024) since it provides first-order gradient expressions without being limited to tabular settings and allows for a non-convex lower loss. However, in bi-level RL, we have to use a biased estimate of λJ(ϕ,λ){\nabla}_{\lambda}J(\phi,\lambda); this bias is a key property of bi-level RL that needs to be accounted for in order to perform convergence analysis. In contrast, the convergence analysis of standard bi-level works such as Kwon et al. (2024) assumes this bias to be zero (See Assumption 11 in Kwon et al. (2024)). In Section 5.1, "Gradient Estimation Error" introduces a novel method for separating estimation bias from stochastic noise in the gradient estimation error for the proxy objective defined in Kwon et al. (2024). This novel decomposition allows convergence analysis within the framework of Kwon et al. (2024) in the presence of biased gradient estimates.

  2. Lemma 1: specifically upper bounds the component of gradient estimation error which arises due to the estimation bias for λJ(ϕ,λ)\nabla_{\lambda}J(\phi,\lambda). We obtain sample complexity bounds of O(ϵ3)\mathcal{O}(\epsilon^{-3}) within the framework of Kwon et al. (2024), where previous bounds were O(ϵ7)\mathcal{O}(\epsilon^{-7}) (Kwon et al., 2024) and O(ϵ6)\mathcal{O}(\epsilon^{-6}) (Chen et al., 2024). Lemma 1 allows us to obtain these improved bounds in the presence of biased estimates for λJ(ϕ,λ)\nabla_{\lambda}J(\phi,\lambda).

  3. Unified first-order analysis We thus obtain the first provable sample complexity for bilevel RL without tabular or second-order assumptions.

Empirical Demonstration:

In Appendix F of the Supplementary Material, we have implemented our algorithm in two RLHF(Reinforcement Learning From Human Feedback) simulations, namely Walker (Deep Mind Control Suite (2013)) and Door Open (Meta World + (2025)). We evaluate our algorithm against [PEBBLE] (Lee et al. 2021), which is a state-of-the-art RLHF algorithm, and show performance improvement.


Weakness 3 - There are a few typos in the paper. For example, in the conclusion, the sample complexity should be stated as O(ϵ3)\mathcal{O}(\epsilon^{-3}) rather than O(ϵ4)\mathcal{O}(\epsilon^{-4}).

Response to Weakness 3:

Thanks a lot, we will revise the typos in the final version, including the one mentioned.


References

  • Lesi Chen, Jing Xu, Jingzhao Zhang "On Finding Small Hyper-Gradients in Bilevel Optimization:Hardness Results and Improved Analysis" COLT 2024.

  • Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak. "On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation". ICLR 2024

  • Kaiyi Ji, Junjie Yang, Yingbin Liang. "Bilevel Optimization: Convergence Analysis and Enhanced Design". ICML 2021.

  • Yifan Yang, Peiyao Xiao, Kaiyi Ji. "Achieving O(ϵ1.5)\mathcal{O}(\epsilon^{−1.5}) Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization". NeurIPS 2023

  • Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, Furong Huang. "PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback". ICLR 2024

  • Yan Yang1^{1}, Bin Gao, Ya-xiang Yuan "Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity". AISTATS 2025

  • Han Shen, Zhuoran Yang, Tianyi Chen. "Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF", ICML 2024.

  • Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, Martin Riedmiller. "Deep Mind Control Suite". arxiv,2018

  • Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro. "Meta-World+: An Improved, Standardized, RL Benchmark". arxiv,2025

  • Kimin Lee, Laura Smith, Pieter Abbeel. "PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training" .ICML 2021

评论

Dear Reviewer m28U,

We sincerely thank you again for your valuable feedback, which has greatly helped us improve the quality of our work. As the discussion deadline is approaching, we kindly request you to review our rebuttal. We are happy to discuss more if you have any further concerns. If our responses adequately address your concerns, we kindly ask you to consider adjusting the corresponding scores. Thank you again for your time and effort!

Regards,

Authors

评论

Thanks for your response and reminder. The reply has satisfactorily addressed my questions, and I'm convinced of the technical contribution of this work. Therefore, I am raising my score.

评论

Dear Reviewer m28U,

Thank you for your valuable feedback and rebuttal acknowledgement. We appreciate your recognizing our contributions and raising the scores.

Regards,

Authors

审稿意见
4

This paper studies the sample complexity of bilevel RL. In bilevel RL, the goal is to minimize an objective function G(ϕ,λ\*)G(\phi, \lambda^\*), where ϕ\phi is the parameter of the reward model, and λ\lambda^* is the parameter of the optimal policy induced by the reward model represented by ϕ\phi. Prior work on this problem requires (approximately) evaluating the Hessian/Jacobian matrices and Hessian / Jacobian vector products, or focuses on the tabular setting. In this paper, the authors propose a new first-order algorithm for bilevel RL that works for continuous state-action spaces, and provide a sample complexity results for their algorithm.

优缺点分析

Strength: This paper gives the first sample complexity result for bilevel RL. Moreover, the algorithm could handle continuous space. Moreover, even for the standard bilevel optimization setup, the proposed algorithm yields improved sample complexity.

Weaknesses:

  1. I would suggest adding a more detailed introduction on bilevel RL. In particular, I would suggest adding more discussion on the connection between bilevel RL and other topics in RL (like inverse RL and RLHF). Currently, bilevel RL is defined in a very abstract manner, and it would be hard for readers (especially those who haven't worked on bilevel RL) to understand its connection to other RL problems.
  2. A more details comparison with [17], especially in terms of technical aspects, should be added. In particular, it is clear that the result in [17] applies only to bilevel optimization problems without an MDP structure, where for bilevel RL, gradient estimates are inherently biased. However, it is unclear to me why the analysis in the paper, when specialized to the standard bilevel optimization setup, yields improved sample complexity bounds. What are the new technical ideas that allow such an improvement?

问题

See Weaknesses above.

局限性

Yes.

最终评判理由

The authors' response resolved my concerns, and therefore I would like to keep my positive score.

格式问题

No such concern.

作者回复

Weakness 1:- I would suggest adding a more detailed introduction on bilevel RL. In particular, I would suggest adding more discussion on the connection between bilevel RL and other topics in RL (like inverse RL and RLHF). Currently, bilevel RL is defined in a very abstract manner, and it would be hard for readers (especially those who haven't worked on bilevel RL) to understand its connection to other RL problems.

Response to Weakness 1: Thank you for this helpful suggestion. We agree that a clearer exposition of bilevel RL and its relationship to other RL frameworks, such as inverse RL and RLHF, would improve accessibility for a broader audience. In the final version, we will expand the introduction to better contextualize bilevel RL. Specifically, we will highlight how it differs from standard RL (where the reward is typically known or fixed a priori) and clarify its relevance in settings where the reward function must be inferred or adapted through nested optimization.

We give a brief overview of the difference between them here

Standard RL: In a standard RL setup, an agent has to learn a policy that maximizes the cumulative reward over a finite or infinite horizon. In such a setup, it is assumed that the reward function is known to the agent and fixed. This includes applications such as ATARI games (Minh et al. 2013). The function to be optimized is denoted by the expected cumulative reward, denoted by J(λ)J(\lambda). Here λ\lambda is the policy parameter.

Bi-level RL: In a bilevel RL, the reward function is not known. An example of such a situation is a situation where a robotic agent has to appear humanlike. The reward function in such a case is a parametrized function rϕr_{\phi} which is updated using demonstrations from a human. These demonstrations are used to update the reward parameter by minimizing a loss function denoted by G(ϕ,λ)G(\phi,\lambda), known as the upper loss function. The learned reward parameter is used to maximize the expected cumulative reward now given to us J(ϕ,λ)J(\phi,\lambda). Thus, we have two loss functions to optimize, and this problem is cast as a bilevel optimization.

For example, in Appendix F of the Supplementary Material, we implement our algorithm for two simulated environments, namely Walker (Deep Mind Control Suite (2013) ) and Door Open (Meta World+ (2025)). In both these environments, the true reward function is unknown to us. We can only query a pair of trajectories from the environment and a label that indicates which of the two trajectories is preferred. Using this data, we learn a reward function that is then used to learn a policy that maximizes the cumulative reward. This process is repeated sequentially (as we have laid out in Algorithm 1). We evaluate our algorithm against [PEBBLE] (Lee et al. 2021), which is a state-of-the-art RLHF algorithm, and show an improvement in performance.


Weakness 2:- A more detailed comparison with [17], especially in terms of technical aspects, should be added. In particular, it is clear that the result in [17] applies only to bilevel optimization problems without an MDP structure, where for bilevel RL, gradient estimates are inherently biased. However, it is unclear to me why the analysis in the paper, when specialized to the standard bilevel optimization setup, yields improved sample complexity bounds. What are the new technical ideas that allow such an improvement?

Response to Weakness 2:

Technical ideas that allow improvement. The key idea behind the improved sample complexity in the bilevel RL setup is the decomposition of the gradient estimation error laid out in section 5.1 in the part titled Gradient Estimation Error. This decomposition is novel and allows us to obtain improved sample complexity. For the bi-level RL case, we also need Lemma 1 and Assumption 2 to account for the additional bias incurred in estimating the gradient λJ(ϕ,λ){\nabla}_{\lambda}J(\phi,\lambda). In the standard bi-level case, Lemma 1 and Assumption 2 are not needed. This is because it is a common assumption that we have access to unbiased and bounded variance estimates for the gradient of both upper and lower loss functions. This stronger assumption is justified in cases such as Hyperparameter optimization (Franceschi et al. 2017). This assumption is also present in bi-level optimization analyses such as Kwon et al. (2020) (See Assumption 3), Chen et al. (2024) (See Assumption 4.2).

Thus, the standard bi-level case is thus a simplified case of the bilevel-RL setup, and has the same sample complexity result. We also note that we have incorrectly stated in the statement of Theorem 2 that Assumption 2 is needed. That is a typo, and we will correct it in the final version. Theorem 2 only needs Assumptions 1 and 4.


References

  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller. "Playing Atari with Deep Reinforcement Learning". NeurIPS 2013.

  • Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, Martin Riedmiller. "Deep Mind Control Suite". arxiv,2018

  • Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro. "Meta-World+: An Improved, Standardized, RL Benchmark". arxiv,2025

  • Kimin Lee, Laura Smith, Pieter Abbeel. "PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training" .ICML 2021

  • Luca Franceschi, Michele Donini, Paolo Frasconi, Massimiliano Pontil. "Forward and Reverse Gradient-Based Hyperparameter Optimization". ICML 2017.

  • Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak. "A Fully First-Order Method for Stochastic Bilevel Optimization". ICML 2020.

  • Lesi Chen, Jing Xu, Jingzhao Zhang "On Finding Small Hyper-Gradients in Bilevel Optimization:Hardness Results and Improved Analysis" COLT 2024.

评论

Dear Reviewer aCd4,

We sincerely thank you again for your valuable feedback, which has greatly helped us improve the quality of our work. As the discussion deadline is approaching, we kindly request you to review our rebuttal. We are happy to discuss more if you have any further concerns. Thank you again for your time and effort!

Regards,

Authors

评论

Thank you for your response. I have no further questions and decide to keep my score.

审稿意见
5

This paper studies first-order, Hessian-free algorithm for bilevel reinforcement learning. The authors consider two settings, with and without access to unbiased gradients, and for both the settings they establish 1/\epsilon^{-4} sample complexity. All analysis relies on a PL condition for the lower-level objective and Lipschitz–smoothness of gradients and Hessians.

优缺点分析

Strength:

  • The authors claim to have improved the sample complexity of Bilevel optimization in the case of Non-convex LL and without second order operations, from already known \epsilon^{-6} to \epsilon^{-3}. Weakness:
  • The convergence results are only to the stationary point. There are no discussions on what is necessary to guarantee global convergence of this algorithm.
  • The analysis is only for a fixed value of \sigma and step size \eta, B, N, and T. Hence, the algorithm does not have asymptotic convergence.
  • The final result is only in terms of \epsilon, and constants which depend on the problem structure are missing in the statement of the theorem. These constants can be specifically important when studying settings like RLHF, where the underlying structure (state, action, and 1-\gamma dependency) is important.
  • Satisfying PL condition for RL problem is not necessarily true. Can you justify this condition for RL setting? Can you provide concrete function approximation settings where all the assumption are satisfied?
  • There are no implementations justifying the sample complexity that is achieved in the theory.

问题

  • What are s_i , a_i in Eq. (1)?
  • In both Theorems 1 and 2 you assume Assumption 2, which specifies an upper bound on the representation power of the neural network, which models the Q-function. It is not clear for me why the approximation error appears in Theorem 1, but not Theorem 2. Can you elaborate?
  • What is the implications of your results for the RLHF theory? What is the query complexity of RLHF when studied under the umbrella of Bilevel optimization? How is the query complexity of RLHF affected by the structure of the underlying MDP?
  • In the conclusion section the authors suggest a \epsilon^{-4} sample complexity, which is in contrast to the rest of the paper. Can you please clarify?
  • Are there any known lower bounds for bilevel optimization?

局限性

Yes

最终评判理由

The authors have addressed my concerns, and I believe this is a strong paper which deserves higher score.

格式问题

N/A

作者回复

Weakness 1:- The convergence results are only to the stationary point. There are no discussions on what is necessary to guarantee global convergence of this algorithm.

Response to Weakness 1: Thank you for your observation.

  • We did not discuss global optimality because our upper-level objective is non-convex, and in such settings, convergence to a stationary point is the standard and, in general, the best attainable guarantee. This is consistent with existing results in both bilevel reinforcement learning (Chakraborty et al. 2024) and general bilevel optimization literature (Chen et al. 2024)
  • Global convergence: While one could obtain global convergence by imposing stronger conditions such as the Polyak–Łojasiewicz (PL) condition on the upper loss function, we chose not to include such assumptions, as they would lead to a more restricted and less realistic setting, as well as making our result not comparable to other works.

Weakness 2: The analysis is only for a fixed value of \sigma and step size \eta, B, N, and T. Hence, the algorithm does not have asymptotic convergence.

Response to Weakness 2: Thank you for pointing this out. Our analysis focuses on finite-time convergence, explicitly characterizing how the rate depends on parameters such as batch size B, number of samples N, and total steps T. This provides a more practical and informative guarantee than asymptotic convergence, which does not offer explicit rates. While we use a fixed step size in our analysis, we validate its effectiveness empirically through the results presented in the paper.

Weakness 3: The final result is only in terms of \epsilon, and ...... when studying settings like RLHF, where the underlying structure (state, action, and 1-\gamma dependency) is important.

Response to Weakness 3: It is correct that our convergence result is in the Big O notation and does not specifically account for every possible constant that can affect the convergence. However, we would like to point out that we are focused on the sample complexity, which focuses on how many samples are needed to get the convergence of a given order of magnitude ϵ\epsilon. This is standard in Bilevel RL works such as Shen et al.(2024) (See table 1) and Yang et al.(2025) (See table 1), which report their convergence rates in terms of Big-O notation. Works such as Chakraborty et al.(2024) report their rates in terms of problem-dependent constants (See constants c1 c_1 and LgL_{g} in Theorem 1). Thus, we are following standard practice when we report our results in Big-O notation.

Weakness 4: Satisfying PL condition for RL problem .... where all the assumptions are satisfied?

Response to Weakness 4: The PL (gradient‑dominance) assumption is not imposed arbitrarily; it is known to hold in several reinforcement‑learning settings.

  • For example, Bhandari et al. (2021) show that if the one‑step policy‑improvement objective is convex, then the full policy‑gradient objective is gradient dominated.
  • In continuous control, the LQR cost is proven to satisfy a PL inequality (Fazel et al. 2018)
  • Mei et al.(2022) demonstrates that SoftMax policy‑gradient objectives satisfy a Łojasiewicz (PL‑like) inequality, and entropy regularization turns it into a full PL condition, leading to linear convergence.
  • Wang et al.(2022) established that the robust RL objective under R‑contamination uncertainty satisfies the PL condition, and Montenegro et al.(2025) show that regularized Lagrangians in constrained RL are PL when using natural or SoftMax parameterizations.

Because our bilevel loss includes a regularization term similar to these works, it enjoys the same gradient‑dominance property. Moreover, in tabular, SoftMax, or linear‑quadratic settings, the PL condition holds exactly, and our analysis covers these function‑approximation cases.

We provide additional justification for the use of the PL condition by implementing the algorithm in the Appendix F of the Supplementary Material.

Weakness 5: no implementations are justifying the sample complexity that is achieved in the theory.

Response to Weakness 5: In Appendix F of the Supplementary Material, we have implemented our algorithm in two RLHF simulations, namely Walker (Deep Mind Control Suite (2013)) and Door Open (Meta World + (2025)). These are RLHF (reinforcement Learning from Human Feedback) benchmarks that have been used to demonstrate the effectiveness of prior bilevel RL works such as Chakraborty et al (2024). We evaluate our algorithm against [PEBBLE] (Lee et al. 2021), which is a state-of-the-art RLHF algorithm, and show performance improvement.

Question 1: What are s_i , a_i in Eq. (1)?

Response to Question 1: That's a typo; it should just be (s,a). We will correct this in the final version.

Question 2: In both Theorems 1 and 2, you assume ... Can you elaborate?

Response to Question 2: Thank you for pointing this out. For Theorem 2, we do not need Assumption 2. That is a typo in the statement of Theorem 2. We only need assumptions 1 and 4 for Theorem 2, and we will correct this mistake in the final version of the paper. The reason for this is as follows:

  • Theorem 1 tackles the bilevel RL setup, which requires the use of a neural network to approximate the Q function. Thus, we do not have access to an unbiased estimate of the Q function. Hence, we need to account for this bias that has been introduced. Assumption 2 thus defines the constant ϵerror\epsilon_{error}, which upper bounds the representation power of the neural network class. This assumption, along with Lemma 1, is used to account for the biased estimate of the Q function. Consequently, ϵerror\epsilon_{error} shows up in the final theorem statement.

  • In contrast, Theorem 2 tackles the standard bi-level setup where it is common practice to assume unbiased and bounded gradient access for both the upper and lower gradient. Such an assumption is used in existing bilevel works, such as Kwon et al. (2020) (See Assumption 3), Chen et al. (2024) (See Assumption 4.2). Thus, we do not need Assumption 2 or Lemma 1, and hence ϵerror\epsilon_{error} does not show up in the statement of Theorem 2.

Question 3: What is the implications of your ....underlying MDP?

Response to Question 3:

  • Query complexity of RLHF. In our setup, for an RLHF implementation, the query complexity is given by the batch size, B, which is the number of samples used at each iteration of the algorithm to estimate the gradient of the proxy objective Φσ(ϕ){\nabla}{\Phi}_{\sigma}(\phi). For an RLHF setup, this corresponds to the number of queries from the human labeler. In the experiment we conducted, the queries corresponded to the demonstrations we queried from the environments, walker, and door open.
  • Query complexity and the structure of the underlying MDP. The query complexity decides how well we estimate the gradient of the proxy objective Φσ(ϕ){\nabla}{\Phi}_{\sigma}(\phi), the closed form expression for which is given in Equation (10). On the right-hand side of Equation (10), note the presence of the term λ(ϕ)\lambda^{*}(\phi). This term is the policy parameter for the optimal policy of the expected return J(ϕ,λ)J(\phi,\lambda), which is closely related to the underlying structure of the MDP. This in turn relates query complexity to the underlying MDP structure.

Question 4: In the conclusion....Can you please clarify?

Response to Question 4: That is a typo. It should be O(ϵ3)\mathcal{O}(\epsilon^{-3}) in the conclusion. We apologize for the typo and will fix it in the final version.

Question 5: Are there any known lower bounds for bilevel optimization?

We are not aware of any when the lower level is non-convex, which is the case in bi-level RL.


References

  • Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, Furong Huang. "PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback". ICLR 2024

  • Lesi Chen, Jing Xu, Jingzhao Zhang "On Finding Small Hyper-Gradients in Bilevel Optimization:Hardness Results and Improved Analysis" COLT 2024.

  • Han Shen, Zhuoran Yang, Tianyi Chen. "Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF", ICML 2024.

  • Yan Yang, Bin Gao, Ya-xiang Yuan "Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity". AISTATS 2025

  • Jalaj Bhandari and Daniel Russo. "Global Optimality Guarantees For Policy Gradient Methods". Operations Research 2024.

  • Maryam Fazel, Rong Ge, Sham M. Kakade, Mehran Mesbahi. "Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator". ICML 2018.

  • Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans. "On the Global Convergence Rates of Softmax Policy Gradient Methods". ICML 2020

  • Yue Wang, Shaofeng Zou. "Policy Gradient Method For Robust Reinforcement Learning". ICML 2022

  • Alessandro Montenegro · Marco Mussi · Matteo Papini · Alberto Maria Metelli. "Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity". ICML 2025.

  • Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, Martin Riedmiller. "Deep Mind Control Suite". arxiv,2018

  • Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, Pablo Samuel Castro. "Meta-World+: An Improved, Standardized, RL Benchmark". arxiv,2025

  • Kimin Lee, Laura Smith, Pieter Abbeel. "PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training" .ICML 2021

  • Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak. "A Fully First-Order Method for Stochastic Bilevel Optimization". ICML 2020.

评论

Dear Reviewer o7DV,

We sincerely thank you again for your valuable feedback, which has greatly helped us improve the quality of our work. As the discussion deadline is approaching, we kindly request that you to review our rebuttal. We are happy to discuss more if you have any further concerns. Thank you again for your time and effort!

Regards,

Authors

评论

Thank you for your response. I have no further questions.

最终决定

This paper tackles the sample complexity of bilevel reinforcement learning (BRL), a framework relevant to settings such as inverse RL and RLHF. The authors propose a first-order, Hessian-free algorithm and establish what they claim to be the first sample complexity bounds for BRL in continuous state-action spaces. The key enabling technique in their analysis is the decomposition of the gradient estimation error arising in RL. Nevertheless, the capability of handling coninuous state-action space stems from the adoption of the first-order gradient expression in existing literature Kwon et al. (2024). Their results extends to general bi-level optimization settings with non-convex lower levels, and improved the existing sample complexity results. The paper also provides finite-time rates, a bias-aware gradient error decomposition, and an implementation demonstrating empirical gains over a strong RLHF baseline (PEBBLE).

Reviewers generally found the paper technically sound and has meaningful contributions. Strengths include the novelty of providing the first explicit sample complexity bounds for BRL in continuous domains, the practical relevance of the first-order Hessian-free approach, and a clear theoretical contribution that improves known rates for general bilevel optimization. The rebuttal clarified key points, including the role of the PL condition in RL settings, the precise assumptions (Lipschitz Hessians rather than “smooth”), unbiasedness vs. bias in gradient estimators, and the query complexity in RLHF scenarios. Concerns centered on (i) reliance on the PL condition, which may not hold broadly across RL settings; (ii) the absence of matching lower bounds to demonstrate tightness; (iii) some initial lack of clarity in assumptions and connections to prior bilevel RL, bilevel optimization, and RLHF work; and (iv) the need for a clearer presentation of technical novelties, especially in relation to prior analyses. Reviewers also noted that the paper’s exposition could better contextualize bilevel RL relative to standard RL, inverse RL, and RLHF. These issues were largely addressed in the rebuttal, with promised revisions to clarify assumptions, correct typos, and expand related work.

Overall, this paper makes a meaningful contribution with rich results that advancing the understanding of bilevel RL.