6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

3.8

置信度

正确性2.5

贡献度2.8

表达2.8

NeurIPS 2024

Simplifying Constraint Inference with Inverse Reinforcement Learning

Adriana Hugessen,Harley Wiltzer,Glen Berseth

OpenReview PDF

提交: 2024-05-15更新: 2025-01-16

TL;DR

Solving a constrained MDP in the forward pass of inverse RL is an unnecessary complication for inferring constraints - constraints can be inferred as well or better through simple IRL methods.

摘要

关键词

reinforcement learninginverse reinforcement learningsafe reinforcement learningconstrained reinforcement learning

评审与讨论

审稿意见

评分: 5置信度: 32024-06-24

This paper proposes a way to reduce the tri-level stucture of ICRL to bi-level, and uses solid experiment results to validate that this bi-level reformulation achieves better expirical results. The authors also intuitively explain that this is due to the fact that the tri-level optimization has a more complicated optimization landscape. In general, this paper clearly delivers its idea, and is easy to follow. However, the major weakness is that this simplification of tri-level to bi-level is trivial, and can be further exploited.

优点

This paper provides solid empirical results to evaluate its idea. This paper answers a fundamental question for ICRL, i.e., there is no fundamental difference between IRL and ICRL. In other words, we can use IRL algorithms to solve ICRL problems.

缺点

The main weakness lies in the core contribution, i.e., the simplification of tri-level to bi-level formulation. This simplication comes from the observation that we can always learn a cost function $c'$ that captures both the dual variable $\lambda$ and the original cost function $c$ , i.e., $c'=\lambda c$ . Therefore, we can learn a single cost function $c'$ to replace $\lambda c$ .

This idea is correct, and I agree that learning a single cost function $c$ is enough. However, this idea is trivial and can be futher exploited. In fact, I have seen similar trick in an ICRL literature [1]. In reference [1], the authors also remove the dual variable $\lambda$ and only learn a single cost function $c$ . Indeed, the reference [1] does not highlight this modification as their novelty, so that it is totally fine for this paper to highlight this trick as a core contribution.

However, this "c'=\lambda c" idea is more like an engineering trick, and more contribution is needed to support this trick. For example, as mentioned in the paper, compared to the tri-level formulation, the bi-level reduction has a simpler optimization landscape and thus is expected to have better optimization result. It will be great if the authors can theoretically support this claim, i.e., theoretically prove that this bi-level formulation is easier to achieve better result than the tri-level formulation.

[1] Liu & Zhu, "Meta inverse constrained reinforcement learning: convergence guarantee and generalization analysis".

问题

Please see weakness.

局限性

The limitation is discussed in the paper and is reasonable.

作者回复

2024-08-07

Thank you very much for your thoughtful consideration of our paper. We understand that your primary concern with the paper is that our principal claim is trivial. While we understand that the derivation is relatively straightforward, we do not think the reduction is necessarily trivial, as evidenced by the number of peer-reviewed papers which propose complex methods for performing ICRL (Malik et al 2021, Liu et al 2023, Kim et al 2023).. Hence, our work would benefit the community by providing a reference and experimental justification for bypassing more complicated ICRL methods in favor of IRL methods for constraint inference.

Regarding your suggestion, “It will be great if the authors can theoretically support this claim, i.e., theoretically prove that this bi-level formulation is easier to achieve better result than the tri-level formulation” - we believe that this is outside the scope of this paper to prove this generally. However, we note that there is already evidence for this in the literature. In particular, GANs are notoriously difficult to train due to the dynamics induced by the adversarial game (Goodfellow et al 2014). Generally, iterates of gradient descent do not converge to saddle points (Freund and Schapire 1997). Some solutions exist for bi-level optimizations / two-player games (Rakhlin and Sridharan 2012, Moskovitz et al. 2023), but we are unaware of solutions to the tri-level optimization problem.

Generative Adversarial Networks, Goodfellow et al. 2014, Training GANs with Optimism, Daskalakis et al. 2017

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Freund and Schapire 1997 Online Learning with Predictable Sequences, Rakhlin and Sridharan 2012

ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs, Moskovitz et al. 2023

G. Liu, Y. Luo, A. Gaurav, K. Rezaee, and P. Poupart. Benchmarking constraint inference in inverse reinforcement learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.

S. Malik, U. Anwar, A. Aghasi, and A. Ahmed. Inverse constrained reinforcement learning. In International conference on machine learning, pages 7390–7399. PMLR, 2021

2024-08-08

Thanks for the response. I will keep the current rating.

评论- Continued discussion

2024-08-09

Dear Reviewer, Thank you for your continued discussion of our work and assistance in improving the research. We believe our paper updates and responses have addressed your concerns for the paper. If not please describe why our response is not sufficient and we would be happy to make further improvements. We look forward to further discussion.

审稿意见

评分: 6置信度: 42024-07-12

The paper proposes a novel inverse constraint learning approach that leverages inverse reinforcement learning (IRL). Prior work by Kim et al. and Malik et al. proposed a game-theoretic approach to the constraint learning problem, where the resulting optimization problem is a tri-level optimization problem involving the policy, the constraint function, and the Lagrange multiplier. The algorithm is an iterated optimization that alternates between constrained RL and constraint function learning. In contrast, the present work leverages the equivalence between the original problem and a simpler IRL problem, which was overlooked in prior work. This reduction removes the Lagrange multiplier from the decision variables and enables the use of an existing IRL algorithm for the purpose of simultaneous constraint learning and constrained policy optimization. Simulation results in the Mujoco environment suggest the feasibility of the IRL approach.

优点

The present work provides a key insight that the inverse constraint learning problem of Kim et al. is equivalent to inverse reinforcement learning, which was overlooked in the prior work. Although the equivalence requires an additional assumption that the set of constraint functions must be closed under multiplication by positive scalars, we can apply game-theoretic IRL algorithms to solve the constraint learning problem as long as we conform to the constraint function class.

缺点

Although the present work provides an interesting insight into constraint learning problems in RL, the paper in its current form possesses multiple issues.

Major Issues

The most concerning issue is the inconclusive and inconsistent nature of the analysis of experimental results, as detailed below.

The feasible rewards and the violation rate are two competing evaluation metrics employed in the paper, which are equally important considering the natural trade-off between performance and safety. Nevertheless, Figure 1 only focuses primarily on the feasible rewards to compare different methods; most notably, it reports information on the violation rate only when it is worse than the proposed vanilla IRL method. Such a partial reporting can easily induce bias in the interpretation of the results, since it effectively ignores circumstances where the violation rate is improved for baselines. The authors should fully report both the feasible rewards and the violation rate, and investigate whether any method pareto-dominates the others in the two evaluation criteria. No definitive conclusion can be drawn without this analysis.
No detailed analysis of Figure 2 is presented in Section 5.1. In fact, the authors did not reference Figure 2 in the main body of Section 5, and the caption does not provide any analysis either. By looking at Figure 2, it is not clear whether any conclusion can be drawn about the relative performance between the proposed algorithms and the two baselines (i.e. MECL and GACL).
Speaking of the performance of MECL and GACL, It is not clear what “a best estimate” means in the footnote on page 6. In the first place, the authors should either reproduce the prior work with PPO or report the missing results as they are, with preference given to the former; they should never estimate missing values. If reproduction is impossible, I would contact the authors and ask for raw data.
Figure 3 is incomplete. Specifically, the left plot on Feasible Reward is missing GACL 80% and GACL 50%. Similarly, the right plot on Violation Rate is missing MECL 50%.
The paper examines whether various techniques for stabilizing constraint learning actually help the IRL approach, and provides some analysis in Figure 4 and Section 5.2. Unfortunately, its credibility is highly questionable given the small sample size (with only 3 seeds) and the fact that the error bars are largely overlapping in Figure 4. Rather than reporting standard deviation, the authors should report more appropriate uncertainty information, such as the standard error of the mean or a confidence interval. If time permits, please also conduct the experiment with more seeds, which may help separate the error bars and aid for more comprehensible results. (Note that the standard error reduces as the sample size is increased, as opposed to the standard deviation.)
The conclusion is partly self-conflicting. Specifically, in Section 5.2 the authors mention batch normalization (BN) and reward normalization (RN). They state that “we find these modifications are generally more harmful than beneficial. Including batch normalization, reward normalization or both, … tended to hurt performance over basic IRL in a majority of environments.” On the contrary, in Section 6 the authors write “certain combinations of additional simple regularization such as batch normalization and reward normalization can produce significantly better results in several environments.”

Minor Issues

Problem Formulation

In equation (6), the Lagrange dual variable $\lambda$ should always be the outer optimization variable. I believe that it should be max min, not min max (unless the minimax theorem holds.)
Equation (8) and its subsequent analysis are central to the development of this paper, and hence may be worth stated as a proposition or a theorem.

Related Work

The authors seem to consider general Imitation Learning (IL) and IRL as two separate problems. A common view is that there are several categories in IL, of which behavioral cloning (BC) and IRL are among the most popular [1][2].
There is one reference that is missing on Page 3, line 98.
The authors state in Section 1 that “we would like to extract safety constraints from the data based on the expert behavior, which can then be used downstream to constrain task-specific learning.” Besides the RL approaches, there is a thread of prior work in learning-based control that considers this problem through the use of barrier functions and fully decoupling downstream task-specific learning from constraint learning (e.g. [3][4][5]), which is missing from the literature review.

[1] Zare, Maryam, Parham M. Kebria, Abbas Khosravi, and Saeid Nahavandi. "A survey of imitation learning: Algorithms, recent developments, and challenges." arXiv preprint arXiv:2309.02473 (2023).

[2] Osa, Takayuki, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. "An algorithmic perspective on imitation learning." Foundations and Trends® in Robotics 7, no. 1-2 (2018): 1-179.

[3] Robey, Alexander, Haimin Hu, Lars Lindemann, Hanwen Zhang, Dimos V. Dimarogonas, Stephen Tu, and Nikolai Matni. "Learning control barrier functions from expert demonstrations." In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 3717-3724. IEEE, 2020.

[4] Lindemann, Lars, Alexander Robey, Lejun Jiang, Satyajeet Das, Stephen Tu, and Nikolai Matni. "Learning robust output control barrier functions from safe expert demonstrations." IEEE Open Journal of Control Systems (2024).

[5] Castaneda, Fernando, Haruki Nishimura, Rowan Thomas McAllister, Koushil Sreenath, and Adrien Gaidon. "In-distribution barrier functions: Self-supervised policy filters that avoid out-of-distribution states." In Learning for Dynamics and Control Conference, pp. 286-299. PMLR, 2023.

问题

In equation (4), do we need the outer optimization in $\lambda$ to be over non-negative real values, not the entire real line (because of the inequality constraint)?
In the IRL formulation, the class of constraint functions $\mathcal{F}_c$ is more restrictive than the original formulation of Kim et al. (which is convex and compact). Can the authors comment on possible negative implications of this restriction?
I did not fully follow the reward scaling part in Section 4.2. I wonder if the authors can elaborate on what they meant by “the constraint function is learned independently of the reward scale and hence may be more robust to different reward scales.”

局限性

The limitations are properly discussed in Section 6, but the inconclusive nature of the analysis can be possibly alleviated in the revision, as discussed above.

作者回复

2024-08-07

Thank you very much for your thoughtful consideration of our paper and the helpful comments and suggestions. We would hope that the improvements we have made and highlighted in the overall response will alleviate some of your concerns, but we would like to address each of your concerns here specifically:

“The authors should either reproduce the prior work with PPO or report the missing results as they are”
- The performance of MECL and GACL reported in Fig 2 and Fig 4 are taken directly from the tables reported in Appendix D of Liu et al. 2023. However, the results of MECL and GACL for Fig 3 were not reported in a table and hence we estimated them based off the plots in Liu et al 2023. We agree that this is insufficient and instead have rerun the author’s code to generate all baseline performance estimates. We provide some of these results in Fig 1 and 2 of the rebuttal PDF which shows IQM on final performance of our proposed modifications versus MECL.
“Figure 1 only focuses primarily on the feasible rewards to compare different methods; most notably, it reports information on the violation rate only when it is worse”
- You are correct that Fig 1 primarily focuses on feasible rewards and only partially reports violation rate, however, we have included both violation rate and feasible rewards in full, both Fig 2 and Fig 3. We have replaced Fig 2 with the IQM plots in Fig 1 in the attached PDF, which will help make the reporting of the violation rate more clear. Additionally, we would like to point out that feasible rewards is a metric that attempts to capture both safety and performance in one metric, as feasible rewards are the rewards obtained in a trajectory only up to the first constraint violation. This is the metric that is proposed for benchmarking ICRL methods in Liu et al 2023.
“No detailed analysis of Figure 2 is presented in Section 5.1”. -Fig 2 is a reduced version of Fig 4 which shows only the best overall and best per-environment modifications, versus the baselines. We did this to make comparison easier, however, we recognize that this may seem redundant and be confusing. Following the suggestion of another reviewer we have moved Fig 4 to the Appendix and replaced it with Fig 2 in the PDF (IQM plots of final performance).
“Figure 3 is incomplete” -Thank you for pointing this out. The issue is that some of the curves are overlapping, which makes them difficult to see. We will clarify this in the revision.
"Unfortunately, its credibility is highly questionable given the small sample size (with only 3 seeds) and the fact that the error bars are largely overlapping"
- We understand your criticism of the statistical analysis, particularly the small number of seeds and reporting of standard deviation. We hope that we have addressed these concerns in our overall response and by including additional random seeds.
“The conclusion is partly self-conflicting” [re the impact of batch norm and reward normalization]. -Thank you for pointing out the confusing wording here. What we meant to say is that overall (across all environments) BN and RN are not generally beneficial. However, in specific environments (eg. Cheetah and Walker) we do see benefits. So they are helpful, but only on an environment-specific basis. We will make this more clear in the revision.

评论- Request for Further Clarification

2024-08-10

Dear authors,

I sincerely appreciate your time and effort in preparing the rebuttal. I also thank that the authors have re-run the evaluation with more seeds and performed more extensive baseline comparison, as well as the statistical post-processing.

Although these new results are more interpretable and encouraging, the authors seem to have only responded to my "major issues" in this individual thread. I would highly appreciate your individual responses to my "minor issues" as well as "questions" so I can make a more informed recommendation.

Best, Reviewer xRd9

评论- Additional responses

2024-08-11

Dear Reviewer, Thank you for your continued discussion of our work and assistance in improving the research. We are glad that you found our additional results more interpretable. Thank you also for pointing out that our response to your review was incomplete. While we had addressed your questions and minor issues, we had erroneously neglected to copy these into our final response here. We sincerely apologize for this oversight and address these issues below:

“In the IRL formulation, the class of constraint functions 𝐹𝑐 is more restrictive than the original formulation of Kim et al. (which is convex and compact). Can the authors comment on possible negative implications of this restriction?”
- Thank you for this question. In fact, the class of constraint function $\mathcal{F}_c$ that we consider is less restrictive than that of Kim et al. because we do not assume compactness (a convex cone violates compactness condition). This condition is required to prove regret bounds for inverse constrained RL given by Kim et al. 2023. However, we note that in large scale applications that approximately optimize a constraint model represented by a deep neural network, one is already in the setting in which these regret bounds do not hold. We make this point more clearly in the revision.
“I did not fully follow the reward scaling part in Section 4.2. I wonder if the authors can elaborate on what they meant”
- The reasoning for using reward scaling in this case is to constrain the rewards from a possibly unbounded range to a smaller range of values, in order to facilitate learning the constraint function. In particular, we are seeking a simple approach to constraint inference that does not require significant environment-specific hyperparameter tuning as ICRL approaches do. We hypothesized that normalizing rewards could lead to a more consistent optimization across environments with different reward scales
"Equation (8) and its subsequent analysis are central to the development of this paper, and hence may be worth stated as a proposition or a theorem."
- Thank you for this suggestion. We hope that Theorem 1 and its proof from our general response has addressed this concern. We have included Theorem 1 in the main paper of our revision and added the proof to the Appendix.
“In equation (6), the Lagrange dual variable 𝜆 should always be the outer optimization variable”
- We believe the order of optimization is correct in Equation (6). Please see Kim et al 2023 Eq. 3 or refer to Theorem 3.6 in Altman 1999
“The authors seem to consider general Imitation Learning (IL) and IRL as two separate problems”.
- Indeed we do consider the two to be separate problems, although they are of course highly related, as you point out. In our opinion, the two are primarily distinguished by the recovery of a reward function or, in this case, constraint function. This is important when we consider the motivation for ICRL which we outline in the introduction, i.e. to learn a constraint function that could be transferred to new tasks to facilitate safer learning. For this reason we consider the distinction between IRL and IL to be important in this case.
“In equation (4), do we need the outer optimization in 𝜆 to be over non-negative real values, not the entire real line (because of the inequality constraint)?”
- Yes, thank you for pointing this out. We will correct it in the revision.
"Besides the RL approaches, there is a thread of prior work in learning-based control that considers this problem through the use of barrier functions"
- Thank you very much for pointing us towards this prior work. We agree that this would be relevant to include as related work and will do so in the final revision.
"There is one reference that is missing on Page 3, line 98."
- Thank you for pointing this out, we will correct it in the final revision.

K. Kim, G. Swamy, Z. Liu, D. Zhao, S. Choudhury, and S. Z. Wu. Learning shared safety constraints from multi-task demonstrations. Advances in Neural Information Processing Systems, 36, 2023.

Altman, E. 1999. Constrained Markov decision processes.

We believe these comments have addressed your remaining concerns with the work. If there are any outstanding issues we would be happy to continue discussion.

评论- Response to Authors

2024-08-13

Thank you for further clarification. Most of the concerns and questions have been resolved thanks to your detailed responses. Thus, I will raise my score. However, I still respectfully disagree with the authors on the taxonomy of IL and IRL. Following prior literature as listed in my initial review, IRL is a particular approach to formulate the broader problem of Imitation Learning (IL). Behavior Cloning (BC) is an altenative approach to IL, but IRL distinguishes itself from BC by learning a reward function in addition to the expert behavior $\pi(a | s)$ .

2024-08-13

Dear reviewer, we agree with your characterization of IRL as a class of BC which recovers the reward function. Thank you again for your valuable suggestions which have greatly helped us to improve our paper. We are glad our additional results and response have addresed your concerns and thank you for increasing your score. We will do our best to further improve our paper.

审稿意见

评分: 7置信度: 52024-07-13

The paper explains how inverse constrained RL (ICRL) - the task of recovering (safety) constraints from expert behaviour respecting those constraints - is equivalent to (straight) inverse RL (IRL) - the task of recovering a reward function from expert behaviour approximately optimizing that reward. This allows the ICRL problem to be solved using the wide array of techniques available in IRL. The authors then go on to empirically show superior performance of using a basic IRL algorithm on the IRL formulation relative to specialized ICRL algorithms.

优点

The paper is rising an important point: ICRL has been arising as a subarea of ML in the past few years with its own methods. Showing that it's equivalent to an existing, more developed area of research (IRL), is certainly useful since it (1) unlocks the use of IRL methods in ICRL (2) puts into question the need for ICRL to exist as a separate subarea and certainly puts more burden on ICRL researchers to show that their methods indeed solve the problem better than vanilla IRL and why. The results from this paper suggest that currently that's not the case.
The paper is clear and easy to understand.
The paper is a prime illustration of the maxim that a machine learning conference paper should ideally make a single crisp point.
Experiments support the claims made in the paper.

缺点

In some places, the paper is a bit too wordy and the same thing could be said as clearly with a shorter sentence (an LLM will surely be happy to provide suggestions, of which at least some may be good).
the statistical methodology is somewhat underwhelming in Section 5, but not a deal-breaker for me. Fig 1 could give confidence intervals on the mean. Similarly, it'd be useful to have a confidence interval on the mean in the other two figures and ideally across more seeds than 3 (since supposedly the hypothesis we're testing in is whether the mean is better than the baselines). Especially results in 5.2 are somewhat messy. Maybe estimating something like Shapley values may be helpful here? Also, normalizing the results and then aggregating across the environments could help paint a clearer picture (you could work with a normalized effect size relative to the IRL baseline as the main quantity). I definitely find mostly the final performance interesting, so I wouldn't be against moving the training curves into the appendix if you want to free up space.

问题

Suggestions for improvement:

There is often missing punctuation after equations (e.g. full stop at the end of line 150 or commas after equations 1 and 6). In general, equations are part of the sentence structure and should be punctuated as such (and don't necessarily need to be preceded by a colon).
Weakly held opinion: equations 5 and 6 would feel more natural to me in negated form, i.e. finding a policy that maximizes the return subject to reward that minimizes the return of the policy relative to the expert. (yes, I'm taking into consideration also that this then leads to equations 7,8). Like this I find them a bit confusing at first without explanation.
The colour-coding of Figure 1 seems a bit unfortunate and doesn't help me much in reading the figure - could e.g. positive values be shades of green and negative ones shades of red with white at zero (not fussy about particular colors, but something that creates more contrast)? Please add x axis labels to Figures 2,3,4. Not all of the horizontal lines in Fig 3 are visible (maybe add a slight offset to prevent perfect overlap?). I'd also prefer seeing confidence interval on the mean across more than 3 seeds, rather than standard deviation.

Minor typos:

l.3 "However" -> "however"
l. 98 has a missing reference
l. 126 has double space

局限性

I think the paper has a clear scope corresponding to the scope of previous work and doesn't introduce significant limitations beyond that.

作者回复

2024-08-07

Thank you very much for your thoughtful consideration of our work. We are very glad that you found our work valuable and greatly appreciate your suggestions for improvement.

Regarding your main concern, “the statistical methodology is somewhat underwhelming in Section 5”, we have incorporated many of your suggestions in the overall response, namely (1) we report only final performance in Fig 2 in the rebuttal PDF, (2) We compute confidence intervals rather than standard deviation (3) we include 5 seeds rather than 3 and (4) We have included Fig 1 in the rebuttal PDF which aggregates across all the environments using an expert normalized score. Regarding your minor suggestions:

“Equations 5 and 6 would feel more natural to me in negated form”: Another reviewer was also confused by this - we will include equations in negated form in the final paper
“The colour-coding of Figure 1 seems a bit unfortunate and doesn't help me much in reading the figure “ - We will replace Figure 1 with Figure 1 from the rebuttal PDF in the final report.

评论- Follow up

2024-08-11

Dear Reviewer, We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.

2024-08-12

Thank you for your response. I appreciate all of the changes made, especially the inclusion of the confidence intervals. The figures now give a better sense of what's going on.

One last comment: in your main rebuttal and in the pdf, you claim that IRL methods Pareto-dominate. I think that claim would make me expect that (1) each of the methods considered performs at least as well as the baseline on (2) each of the tasks considered. I would probably to refrain from using the term when considering that the mean (across both tasks and IRL variants) is higher.

And a minor point: if you wanted to aim for plotting perfection, you could harmonize colour-coding between Figs 1 and 2, which would make comparison easier.

That said, I'm keeping the favourable score of "Accept".

2024-08-12

Dear reviewer, thank you very much for your feedback and favorable review. Your suggestions have been very helpful in improving our paper. We will incorporate both of these suggestions in our final revision as we continue to improve our paper.

审稿意见

评分: 6置信度: 32024-07-14

The authors propose a method for learning the constraints from demonstrations. To achieve this goal they take note of the similarities between inverse reinforcement learning (IRL) and inverse constraint reinforcement learning (ICRL). The authors aim to reduce the tri-level optimization of constrained inverse reinforcement learning to a bi-level optimization using a special class of constraints.

优点

The main claim of the paper with regard to reducing the ICRL to IRL under certain classes of constraint functions is original and can be quite impactful.

缺点

The authors' claim that the tri-level optimization can be reduced to a bi-level has a mathematical motivation but it not supported by a mathematical derivation. The authors aim to support the claim via experimental results but the results have high variance and are thus not fully convincing.
The paper's main claims are tested using methods developed by prior work (MECL, GACL). The specifics of the modifications done by the authors are somewhat unclear however. The authors mention using SAC instead of PPO for the IRL implementation of the prior method which they compare with. Later it is mentioned that their results are directly taken from the prior works due to lower performance of their SAC-based implementation.
The paper combines various modifications to prior work (batch and reward normalization, separate critics, policy reset) in order to examine their benefits and support the main claims. However due to large variance in results in limited locomotion environments, it is difficult to accept the results with high degree of certainty.

问题

The clarity of the paper can be significantly improved by adding diagrams or pseudocode describing their specific modifications to the previous IRL methods.
Question mark instead of a citation on line 98 (Related Works -> Inverse Reinforcement learning).
Eq. (1) describes the IRL objective. In its current form it aims to find a reward function that minimizes the difference between the return of the expert and the current policy while maximizing this difference. The objective should be the opposite where the policy aims to reduce the difference between the return on the expert and the current policy and the reward should aim to maximize the difference between the return of the expert and the current policy.
Given that this paper focuses on learning policies that violate fewer constraints and given that the claims are supported mainly by experimental results, the experiments section should focus on a more diverse set of environments that require safety constraints or at least describe in more details the constraints present in the Mujoco environments used for the experiments.
If the tri-level optimization can be reduced to a bi-level IRL problem, can it then be further reduced to a single level optimization using methods such as Inverse Q-Learning [1] or IQ-Learn [2]? How does the constraint violation rate of the proposed method compare to such algorithms?

[1] Kalweit, G., Huegle, M., Werling, M., & Boedecker, J. (2020). Deep Inverse Q-learning with Constraints. Advances in Neural Information Processing Systems, 33, 14291-14302. [2] Garg, D., Chakraborty, S., Cundy, C., Song, J., & Ermon, S. (2021). IQ-Learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34, 4028-4039.

局限性

High variance in the experimental results (acknowledged by the authors).
The number of constrained environments and the discussion on how these constraints are present in each environment is lacking.

作者回复

2024-08-07

Thank you very much for your thoughtful consideration of our work. We would hope that the improvements we have made and highlighted in the overall response will alleviate some of your concerns, but we would like to address each of your concerns here specifically:

“The authors' claim that the tri-level optimization can be reduced to a bi-level has a mathematical motivation but it not supported by a mathematical derivation.”
- We hope the provided proof in the general response alleviates your concerns here.
“The specifics of the modifications done by the authors are somewhat unclear”, “The clarity of the paper can be significantly improved by adding diagrams or pseudocode describing their specific modifications to the previous IRL methods.”
- We have added pseuocode for the separate critics IRL method for inferring constraints in the attached PDF. The additional modifications (policy reset, batch norm and reward normalization) are implemented as standard.
“Later it is mentioned that their results are directly taken from the prior works due to lower performance of their SAC-based implementation.”
- Indeed, we used SAC in our implementation because the ICRL method we compare against is motivated by maximum-entropy RL. In their implementation, they use an entropy bonus with a PPO objective. Though we tried various learning rates for the ICRL method in SAC (see Appendix 1), we needed to tune all hyperparameters from scratch due to the change in algorithm. In the code of Liu et al 2023, it appears that hyperparameters were tuned quite specifically to each environment, and this full hyperparameter tuning was beyond the scope of our work. Hence, we compare directly to the reported results from Liu et al 2023, without reimplementing in SAC.
“However due to large variance in results in limited locomotion environments, it is difficult to accept the results with high degree of certainty.”
- We hope that our general response has alleviated the concerns regarding the variance in the results. Regarding the diversity in environments, we use these environments because of their precedent in prior work on ICRL (Liu et al 2023, Malik et al 2021)
“Eq. (1) The objective should be the opposite”
- We believe the equation is correct, however, another reviewer pointed out that it would be more intuitive to report the negation of these equations. We will do this in the final version.
"If the tri-level optimization can be reduced to a bi-level IRL problem, can it then be further reduced to a single level optimization using methods such as Inverse Q-Learning [1] or IQ-Learn [2]? How does the constraint violation rate of the proposed method compare to such algorithms? "
- Thank you for raising this question. It should potentially be possible to do this for IL purposes, however, in ICRL the goal is generally to recover the constraint function. It is not entirely straightforward how to disambiguate rewards and constraints in a method like IQ-learn which learns the Q function directly. This may be an interesting direction for future work.

S. Malik, U. Anwar, A. Aghasi, and A. Ahmed. Inverse constrained reinforcement learning. In International conference on machine learning, pages 7390–7399. PMLR, 2021

评论- Follow up

2024-08-11

2024-08-14

I thank the authors for their responses and the additional clarification. My concerns have been addressed, and I raised my scores accordingly.

作者回复

2024-08-07

We thank the reviewers for their thoughtful comments and suggestions. We agree with the reviewers that the main claim of our paper is that treating ICRL as a separate problem from IRL may not confer a particular benefit and, in fact, it should be beneficial to not segment the problem class because it means we can make use of the wide literature in IRL to solve ICRL problems. We are glad that most reviewers agree this is an interesting idea. We also provide empirical analysis to validate this claim and provide details on how to best use IRL for ICLR. We understand the primary concern of several reviewers is the statistical robustness of these results, including the high variance, the low number of seeds and the reporting of the statistical significance. We have taken the following steps to alleviate these concerns:

We have rerun all experiments across 5 seeds to improve the robustness of our results. This is consistent with the number of seeds used by Liu et al 2023.
We believe that the way we are currently reporting results may overstate the variability of the results. In Figs 2,3,4 we are showing error bars as standard deviation across both seeds and the time smoothing window. Hence, the error bars contain both aleatoric uncertainty (variation across episodes in a single agent) and epistemic uncertainty (variation across seeds). Following several reviewers suggestions we propose replacing Figs 1 and 4 with Figs 1 and 2 in the attached PDF, respectively. This makes the following adjustments:
- In both figures, we use the following procedure. We first compute the mean of the last 50 testing episodes. We then compute bootstrapped confidence intervals of the IQM (interquartile mean) across the 5 seeds using the methodology proposed in Agarwal et al 2021
- In Fig 1 in attached, we have additionally normalized by expert performance and reported an overall performance curve for all environments, as recommended in Agarwal et al 2021. This further increases the sample size to reduce variance. Since the baseline MECL is trained with PPO, and so has a different training time, we take only the IQM of final performance and report this as a sold dashed line (without CI) as a comparison. We believe this very clearly illustrates that IRL methods pareto-dominate the baseline ICRL method in these environments.
- In Fig 2 in attached, we now report only final performance on a per-environment basis, which we hope facilitates readability when comparing across modifications and environments.
We have also rerun the author’s provided code for the baseline method MECL from Liu et al 2023 and included the same IQM with confidence intervals for this method, computed over 5 seeds in Fig 2 attached. This allows us to also compare the variance of our method to that of the baseline MECL. We note that the variance in the results of MECL are also very high as can be seen in Fig 2 in the attached PDF. In general, high variance can be an issue for adversarial IRL, however, we do not think this issue is unique to our method. We also note that upon re-running the author’s code, we find that the reported performance of ICRL is not reproducible with the code provided, using the final performance (the author's may have reported best performance). This further improves the relative performance of IRL over the ICRL baseline.

Finally, we understand that several reviewers would like a more rigorous proof of the equivalence between ICRL and IRL. We offer the following, which is added to the Appendix of the paper:

Theorem. Let $\Pi$ denote a class of policies and let $\pi_E\in \Pi$ and $r:\mathcal{X}\times\mathcal{A}\to\mathbb{R}$ be a fixed (expert) policy and reward function, respectively. For any class of constraint functions $\mathcal{F}\subset \mathbb{R}^{\mathcal{X}\times\mathcal{A}}$ , we define the following objectives:

$\mathsf{OPTicrl}(\mathcal{F}) = \max_{c\in\mathcal{F}}\max_{\lambda \geq 0}\min_{\pi\in\Pi}\left[J(\pi_E, r - \lambda c) - J(\pi, r - \lambda c)\right]$

$\mathsf{OPTsimple}(\mathcal{F}) = \max_{c\in\mathcal{F}}\min_{\pi\in\Pi}\left[J(\pi_E, r - c) - J(\pi, r - c)\right]$

where $J(\pi, f)$ denotes the expected return (averaged over initial states) earned by the policy $\pi$ for the reward function $f$ . Then, if $\mathcal{F}$ is a convex cone, it holds that $\mathsf{OPTicrl}(\mathcal{F}) = \mathsf{OPTsimple}(\mathcal{F})$ .

Proof. Suppose $\mathcal{F}$ is a convex cone. We will first show that $\mathsf{OPTsimple}(\mathcal{F})\geq \mathsf{OPTicrl}(\mathcal{F})$ . Let $\mu^\pi$ denote the (discounted) occupancy measure of policy $\pi$ . It is well known that $J(\pi, f) = \mu^\pi f$ . Then, we have

$\mathsf{OPTsimple}(\mathcal{F}) = \max_{c\in\mathcal{F}}\min_{\pi\in\Pi}(\mu^{\pi_E} - \mu^\pi)(r - c)$

$\geq \max_{c\in\mathcal{F}}\min_{\pi\in\Pi}(\mu^{\pi_E} - \mu^\pi)(r - \lambda c) \forall \lambda > 0$

$\geq \max_{c\in\mathcal{F}}\max_{\lambda\geq 0}\min_{\pi\in\Pi}(\mu^{\pi_E} - \mu^\pi)(r - \lambda c)$

$= \mathsf{OPTicrl}(\mathcal{F})$

where the first inequality holds since $c\in\mathcal{F}\implies \lambda c\in\mathcal{F}$ by the hypothesis that $\mathcal{F}$ is a convex cone. It remains to show that $\mathsf{OPTsimple}(\mathcal{F})\leq\mathsf{OPTicrl}(\mathcal{F})$ . This is simply shown by

$\mathsf{OPTsimple}(\mathcal{F}) = \max_{c\in\mathcal{F}}\min_{\pi\in\Pi}(\mu^{\pi_E} - \mu^\pi)(r - 1\cdot c)$

$\leq \max_{c\in\mathcal{F}}\max_{\lambda\geq 0}\min_{\pi\in\Pi}(\mu^{\pi_E} - \mu^\pi)(r - \lambda c)$

$= \mathsf{OPTicrl}(\mathcal{F})$

We will address each reviewer’s particular concerns in our individual responses. Thank you again for your thoughtful consideration of our paper.

Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.

Liu,, Guiliang et al: Benchmarking Constraint Inference in Inverse Reinforcement Learning. ICLR 2023

最终决定Accept (poster)

2024-09-25

The paper's main contribution is an observation that in order to infer RL constraints from data, instead of solving a tri-level optimization problem one can solve a simpler bi-level one, under reasonable assumptions on the constraint functions.

Originally the reviewers had several concerns, mainly focused on the paper's lack of theoretical justification and the statistical significance of the experimental results. However, the authors successfully addressed most of them in the rebuttals, including by adding a formal result about the algorithm's correctness. While there are still presentation issues to be addressed and, overall, some reviewers find the paper's contribution straightforward, the metareviewer thinks this contribution is valuable nonetheless, trusts that the authors will improve the manuscript per the reviewers' suggestions, and recommends this work for acceptance.