PaperHub
7.5
/10
Spotlight4 位审稿人
最低6最高8标准差0.9
8
6
8
8
2.5
置信度
ICLR 2024

Provable Offline Preference-Based Reinforcement Learning

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-18
TL;DR

PAC offline reinforcement learning from preference feedback over trajectories using general function approximation.

摘要

关键词
reinforcement learning theoryoffline reinforcement learning

评审与讨论

审稿意见
8

The authors propose a PbRL algorithm via computing a return utility function, based on MLE. They then construct a confidence set, based on that estimate to solve a planning problem for deriving an approximate, optimal policy. In contrast to most PbRL work, they do not assume the existence of state-action utility. This algorithm forms the base for analyzing the algorithms sample complexity, using a novel trajectory-based concentration coefficient. This analysis is also applied to a setting with learned transition dynamics and action preferences.

优点

Most work in PbRL focuses on empirical evaluations, therefore additional work considering the theoretical background is important. This work is especially relevant, as it is not restricted to a specific class of function approximation and considers return-based utility. Reward-based utility is usually easier to handle, but can induce abnormalities in the preference setting, therefore the return-generalization is also highly interesting. Adapting the concentrability coefficient to this setting is interesting as it allows to separate (offline) optimization of the policy from the issues of exploration/exploitation.

缺点

The most substantial weakness of the paper is, that the related work focuses on other, theoretical work, but not practical. E.g. the idea of return-based utility approximation is not novel (e.g. Preference-Based Policy Learning, Akrour, 2011 - see Wirth 2017 for more examples). This is relevant because of two reasons. First, it allows the reader to determine which algorithms fit under the proposed framework and second, it clarifies the differences between the proposed algorithm and other works. Considering the algorithm is mentioned as one of the main contributions, but likelihood-based estimation of the return utility is already known, the novelty of the algorithm is a bit limited.

Besides that, there are several possibilities to improve the presentation:

  • Reward is defined wrt. trajectories, but this is commonly denoted return in the PbRL literature
  • Some relevant information is left to the appendix - mostly the feasible implementation and the comparison to Zu 2023. Especially some details concerning the feasible implementation are relevant for practioners.
  • Remark 3 is not explained

Furthermore, to the reviewer, one of the statements is not obvious:

  • cannot be relaxed to the per-step concentrability without additional assumptions, such as on the reward structure - Why does it follow from the Theorems, that state-action rewards are not enabling this?

As small remarks:

  • Missing reference: "..per-step concentrability coefficient Cst commonly used in the general offline RL literature"
  • Missing reference: "..It is well-known that when the reward is state-action-wise, the optimal policy π is both Markovian and deterministic."
  • Definition 1 is a bit hard to understand, because g is never define.
  • The slack parameter in Sec. 4.1 has the wrong symbol
  • Typo in Alg 2: "Distributionally robust plnanning"

问题

Please explain "cannot be relaxed to the per-step concentrability without additional assumptions, such as on the reward structure" (see above)

评论

Thank you for the valuable feedback! Here are our responses:

  • Comparison against Practical Work

We mainly discuss theoretical works in the related works because our paper also focuses on the theory of offline RLHF. That said, our algorithm is still quite different from the practical algorithms presented in [1]. Although they also use MLE to estimate the reward model, they typically run RL algorithms with respect to the learned MLE rewards (i.e., r^\hat{r} in Algorithm 1) without any pessimism. The lower bound in [Theorem 3.9, 2] has shown that this kind of algorithm can fail in the worst cases.

In contrast, we use pessimism to circumvent such failure. More specifically, we construct a confidence set of the reward model and then evaluate each policy with the most pessimistic reward model inside the confidence set. This is the key difference between our algorithm and existing commonly used algorithms in [1], and this modification is crucial to overcome their limit.

  • Reward vs Return

Thanks for pointing it out! We will revise this in the final version.

  • Appendix

Thanks for pointing it out! We will move the key points in Appendix to the main paper in the final version.

  • Remark 3

We will clarify it as follows. In line 4 of Algorithm 1, we need to compute Eτμref[r(τ)]\mathbb{E} _ {\tau \sim \mu _ {ref}}[r (\tau)], which might be computationally inefficient since the space of all trajectories is large. In this case, we can sample N0N_0 trajectories (τi)i=1N0(\tau ^ i) _ {i = 1} ^ {N _ 0} from μref\mu _ {ref} and then simply use 1N0i=1N0r(τi) \frac{1}{N _ 0}\sum _ {i=1} ^ {N _ 0} r(\tau ^ i) as a surrogate of Eτμref[r(τ)]\mathbb{E} _ {\tau \sim \mu _ {ref}}[r (\tau)]. From Azuma-Hoeffding inequality, this will only incur an error scaling with 1N0\frac{1}{N _ 0}. Therefore as long as we choose N0=O(1/ϵ2)N _ 0 = O(1 / \epsilon ^2), this error can be neglected while line 4 becomes more computationally efficient.

  • Cannot Relaxed to Per-step Concentrability

Theorem 2 shows that there exists a set of MDPs whose rewards are defined per step (i.e., r(τ)=h=1Hrh(sh,ah)r(\tau) = \sum _ {h=1} ^ H r _ h (s _ h, a _ h)) and an offline dataset which satisfies the per-step concentrability such that no offline algorithm is able to learn a near-optimal policy. This implies that with only per-step concentrability, we are not able to find a high-quality policy, even if we assume the rewards of the MDPs are defined per step. In contrast, with our trajectory-wise concentrability coefficient in Definition 2, we are able to learn a near-optimal policy for any MDPs, no matter whether the reward is defined over trajectories or defined per step. We will add this explanation in the final version.

  • Missing Reference

We are sorry about the missing references. We list the references here and will add them in the main paper: Per-step concentrability coefficient: [3-7], Optimal policy: [8]

  • Definition of g(τ1,τ2)g(\cdot|\tau _ 1,\tau _ 2)

We are sorry about the unclearness of the definition of gg. gg is a function mapping from (T×T)(\mathcal{T} \times \mathcal{T}) (recall that T=(S×A)H\mathcal{T} = (\mathcal{S} \times \mathcal{A}) ^ H is the set of all possible trajectories) to R2\mathbb{R}^2 and thus g(τ1,τ2)g(\cdot|\tau _ 1, \tau _ 2) is a two-dimensional vector. It doesn’t need to be a distribution, i.e., g(τ1,τ2)g(\cdot|\tau _ 1, \tau _ 2) is not necessarily positive and g(τ1,τ2)1\Vert g(\cdot|\tau _ 1, \tau _ 2) \Vert _1 doesn’t need to be 1. Similar functions are elaborated in the definitions in Appendix E. We will revise this in the final version.

  • Wrong Symbol and Typo

Thanks for pointing them out! We will correct them in the final version.

[1] Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.

[2] Zhu, B., Jiao, J., and Jordan, M. I. (2023). Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. arXiv:2301.11270.

[3] Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. arXiv:2106.06926

[4] Uehara, M. and Sun, W. (2021). Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. In arXiv:2107.06226.

[5] Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. (2022a). Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR.

[6] Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022). Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. arXiv:2202.13890.

[7] Li, G., Shi, L., Chen, Y., Chi, Y., andWei, Y. (2022b). Settling the sample complexity of model-based offline reinforcement learning. arXiv:2204.05275.

[8] Bertsekas, D. P. (2017). Dynamic programming and optimal control. Athena Scientific.

评论

Thanks for the clarifications. I slightly increased my score, mostly due to the "Comparison against Practical Work". I expected this to also be included into the CR version. (At least as a condensed version)

评论

Thank you so much for increasing the score! We will include that part in the final version.

审稿意见
6

The paper aims to investigate preferance based offline reinforcement learning. They consider a setting where the dataset consists of trajectory pairs with a preference. They assume a preference model over the trajectories and come up with an algorithm which has two parts, the first to obtain a confidence set of reward functions using maximum likelihood estimation and the second part is to obtain the policy over this confidence set. Their algorithm provides a guarantee that allows to learn the target policy with a polynomial number of samples.

优点

(1) The problem of reinforcement learning with human feedback, specifically offline preference learning is highly relevant.

(2) Their algorithm provides a guarantee of being able to learn the target policy using polynomial number of samples.

(3) They extend their algorithm to settings with unknown transition kernel and state-action preferences.

(4) They introduce a new single-policy concentrability coefficient to measure the coverage of the target policy on the dataset.

缺点

(1) There are no implementation details about the algorithms. These algorithms are theoretically sound but there is no intuition for the reader on how to implement them or construct a practical algorithm.

(2) There are no experiments either. So there is no way to check how this method empirically scales.

(3) Several terms in the paper are not well explained. For example, ϵ\epsilon-bracketing: what do you mean by g(.|tau1, tau2)? Is g a probability measure, or a preference?

(4) Proposition 1: For epsilon being arbitrarily small, and the bounds B and R being arbitrarily large, log N_G_r can be arbitrarily large making the sample bounds weak.

(5) The sample bounds i.e. sample complexity N = O(..log (..1/N)). How can N be on both sides, something seems missing.

问题

(1) In Section 4.1, it is mentioned the distance between r and r* is computed as total variation distance or l1 norm. Arent r and r* scalars?

(2) Does C_tr (per trajectory concentration coefficient) not depend on \mu_1? If so, why?

(3) How does C_r(G_r, \pi_tar, \u_1) reduce to sqrt(Ctr)?

(4) It looks like pi, mu and d are used interchangeably. I recommend correcting this.

评论
  • from Cr(Gr,πtar,μref)C_r(\mathcal{G} _ r, \pi _ {tar}, \mu _ {ref}) to CtrC _ {tr} in Questions

Note that in this reduction we take μref\mu _ {ref} to be μ1\mu _ 1. From Cauchy-Schwartz inequality, we have Eτ0πtar,τ1μ1[r(τ0)r(τ1)r(τ0)+r(τ1)2]Eτ0πtar,τ1μ1[r(τ0)r(τ1)r(τ0)+r(τ1)]. \sqrt{ \mathbb{E} _ {\tau ^ 0 \sim \pi _ {tar}, \tau ^ 1 \sim \mu _ 1}\big[|r ^ *(\tau ^ 0) - r ^ *(\tau ^ 1) - r(\tau ^ 0) + r(\tau ^ 1)| ^ 2 \big] } \geq \mathbb{E} _ {\tau ^ 0 \sim \pi _ {tar}, \tau ^ 1 \sim \mu _ 1}\big[|r ^ *(\tau ^ 0) - r ^ *(\tau ^ 1) - r(\tau ^ 0) + r(\tau ^ 1)|\big].

This implies that Cr2(Gr,πtar,μ1)suprEτ0πtar,τ1μ1[r(τ0)r(τ1)r(τ0)+r(τ1)2]Eτ0μ0,τ1μ1[r(τ0)r(τ1)r(τ0)+r(τ1)2]C^2_r(\mathcal{G} _ r, \pi _ {tar}, \mu _ 1) \leq \sup _ {r} \frac{ \mathbb{E} _ {\tau ^ 0 \sim \pi _ {tar}, \tau ^ 1 \sim \mu _ 1}\big[|r ^ *(\tau ^ 0) - r ^ *(\tau ^ 1) - r(\tau ^ 0) + r(\tau ^ 1)| ^ 2\big]}{ \mathbb{E} _ {\tau ^ 0 \sim \mu _ 0,\tau ^ 1 \sim \mu _ 1}\big[|r ^ *(\tau ^ 0) - r ^ *(\tau ^ 1) - r(\tau ^ 0) + r(\tau ^ 1)| ^ 2\big]}

Since τ0\tau _ 0 and τ1\tau _ 1 are independent, we have Cr2(Gr,πtar,μ1)suprmaxτ0,τ1dπtar(τ0)μ1(τ1)μ0(τ0)μ1(τ1)=maxτdπtar(τ)μ0(τ).C ^ 2 _ r(\mathcal{G} _ r, \pi _ {tar}, \mu _ 1) \leq \sup _ {r} \max _ {\tau _ 0, \tau _ 1 }\frac{ d ^ {\pi _ {tar}}(\tau _ 0) \mu _ 1(\tau _ 1)}{ \mu _ 0(\tau _ 0) \mu _ 1(\tau _ 1)} =\max _ {\tau} \frac{d ^ {\pi _ {tar}}(\tau)}{ \mu _ 0(\tau)}.

  • π,μ\pi, \mu and dd in Questions

We are sorry for the unclearness. For π\pi, we use τπ\tau \sim \pi to denote the distribution of τ\tau when executing π\pi in the MDP. When τ\tau follows a more general distribution μ\mu, i.e., when the distribution may not be induced by a policy π\pi, we use τμ \tau \sim \mu. For dd, we only use dπ(τ)d ^ {\pi} (\tau) to denote the probability of τ\tau under policy π\pi. We will further clarify these in the final version.

[1] Rigter, M., Lacerda, B., and Hawes, N. (2022). Rambo-rl: Robust adversarial model-based offline reinforcement learning. Neurips 2022.

[2] Xie, T., Cheng, C. A., Jiang, N., Mineiro, P., & Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34, 6683-6694.

[3] Cheng, C. A., Xie, T., Jiang, N., & Agarwal, A. (2022, June). Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning (pp. 3852-3878). PMLR.

[4] Uehara, M. and Sun, W. (2021). Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. ICLR 22

[5] Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge university press.

[6] Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2019). Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32

评论

Thank you for the valuable feedback. Here are our responses:

  • Implementation Details and Empirical Experiments in Weaknesses

We would like to clarify that this is mainly a theoretical work that strives to find sample efficient offline RLHF algorithms with provable guarantees. That said, we would like to point out that we indeed have discussions about the practical implementation of our algorithm in Remark 1 in the main text and Appendix B. In summary, we use the Lagrangian multiplier to convert our algorithm into a maximin optimization problem and demonstrate how to compute the gradient of the objective function so that we can use gradient ascent-descent to solve it. Given this type of gradient descent-ascent has shown superior performance in standard offline RL problems [1], we believe that the above approach can also work well in practice in our preference-based RL setting. We will emphasize this part more in the final version.

  • Response to the claim: There are no experiments either. So there is no way to check how this method empirically scales.

We would like to mention that there are many practical and impactful works about offline RL in top machine learning conferences without any experiments. First, for example, in the literature on offline model-free RL, [2] proposed the first provably efficient algorithm with general function approximation. While this paper does not have any experiments, it is later implemented in another follow-up paper [3]. It is reported that the proposed algorithm achieves SOTA and beats many offline RL baselines, such as CQL, COMBO ICL, and TD3+BC.

Similarly, in the literature on model-based RL, [4] proposed the first probably efficient algorithm with general function approximation. While this paper does not have any experiments again, later, it is implemented in another follow-up paper [1], and it is reported that it achieves SOTA and beats many baselines, such as again MOReL, CQL, IQL, and TD3+BC. In this way, even with no experiments, theoretically backed algorithms have been empirically impactful.

  • Definition of g(τ1,τ2)g(\cdot|\tau _ 1,\tau _ 2) in Weaknesses

We are sorry about the unclearness of the definition of gg. gg is a function mapping from (T×T)(\mathcal{T} \times \mathcal{T}) (recall that T=(S×A)H\mathcal{T} = (\mathcal{S} \times \mathcal{A}) ^ H is the set of all possible trajectories) to R2\mathbb{R}^2 and thus g(τ1,τ2)g(\cdot|\tau _ 1, \tau _ 2) is a two-dimensional vector. It doesn’t need to be a distribution, i.e., g(τ1,τ2)g(\cdot|\tau _ 1, \tau _ 2) is not necessarily positive and g(τ1,τ2)1\Vert g(\cdot|\tau _ 1, \tau _ 2) \Vert _1 doesn’t need to be 1. Similar functions are elaborated in the definitions in Appendix E. We will revise this in the final version.

  • Proposition 1 in Weaknesses

We would like to state that our bounds are not weak compared to standard RL literature. First, as you notice, the log bracket number increases logarithmically as the accuracy ϵ\epsilon decreases. However, as we will later see in Theorem 1, we just set ϵ=1/N\epsilon=1/N, which results in an additional log(N)\log(N) term. This term is negligible compared to the leading term 1/N1/N. Second, the log dependence on BB and RR is very standard in the literature on statistical machine learning in general, as seen in Section 4,5 in [5] and Definition 8.1 in a standard RL textbook [6].

  • Sample Bounds in Weaknesses

We are sorry for the misleading expression. Here, we take NN so that the right-hand side of (2) in Theorem 1 is less than ϵ\epsilon. So, this expression means if NN satisfies (2), we can get an ϵ\epsilon-optimal policy. To remove NN from the right-hand side, with some algebra, we can state that as long as N=O(log(1/ϵ)/ϵ2)N=O(\log(1/\epsilon)/\epsilon^2) (when we just focus on the dependence on ϵ\epsilon), we can get an ϵ\epsilon-optimal policy. We will clarify it in the final version.

  • rr and rr ^ * in Questions

No, rr and rr ^ * are reward functions. They map a trajectory τ\tau to a scalar. Thus, here rr (rr^*) can be viewed as a vector whose τ\tau-th entry corresponds to the value of r(τ)r(\tau) (r(τ)r^*(\tau)) for each τT\tau \in \mathcal{T}.

  • CtrC _ {tr} in Questions

No, it doesn’t depend on μ1\mu _ 1 because when we reduce CrC _ r to CtrC _ {tr}, we take μref=μ1\mu _ {ref} = \mu _ 1 in Cr(Gr,πtar,μref)C _ r(\mathcal{G} _ r, \pi _ {tar}, \mu _ {ref}) and then μ1\mu _ 1 is canceled out by μref\mu _ {ref}.

评论

Dear reviewer,

There are only three days left for the discussion period. Do you have any other questions that we can help with?

评论

Thanks a lot for your response. Thanks a lot for pointing out the discussion on the practical implementation. I do understand that the paper focuses on theoretical results and introduces several theoretical insights that are valuable for the community. On the other hand, I feel it is also important for works to introduce a practical version of their algorithms, and provide some empirical evaluations to support their theoretical claims. For instance, if you provide an algorithm with better sample efficiency, you could have come up with a very simple experiment showing how the algorithm has better sample efficiency than previous methods.

Nevertheless, I believe that the theoretical insights introduced in the paper are relevant to the community and have increased my score.

评论

Thank you very much for increasing the score! We appreciate your feedback and will try adding some experiments in the future.

审稿意见
8

Contributions

  • This paper proposed the first algorithm for offline trajectory-wise PbRL with general function approximation and under partial coverage.
    • The reward function can be defined over the whole trajectory, or for each state action pair.
    • Extending to the case where transition kernel is also learned, and the case where data is composed of action comparisons, rather than trajectory comparisons.
  • Setting: finite horizon MDP with trajectory-wise reward.
  • Preference model: known link function.
  • Goal: ϵ-δ PAC offline RL: use offline dataset {(traj 1, traj 2, preference signal)} to learn the optimal policy, with unknown reward (extended to also unknown transition kernels).
  • Key innovation: function approximation
    • Use a given realizable function class Gr to learn the reward function r.
    • Measure the complexity as the ϵ-bracketing number of the set of preferences induced by Gr.
  • If transition kernels is known:
    • Algorithm
      • (1) MLE the reward, r\hat, using the data; (2) construct a confidence set around r\hat; (3) distributionally robust planning to jointly max total reward and stay close to some reference trajectory, under the worst-case reward in the confidence set.
    • Analysis
      • This paper developed concentrability coefficient for preference-based feedback (Sec.4.2) and discuss the relation bw per-trajectory concentrability coefficient vs per-step concentrability coefficient (Sec.4.3). This result indicates that trajectory-wise feedback is intrinsically harder than step-wise feedback in offline PbRL.
      • The resulting PAC bound (under Eq.2) depends on this coefficient and also the ϵ-bracketing number of the function class.
  • If transition kernels is unknown:
    • Algorithm: also do MLE and confidence set for transition kernel.
    • Analysis: concentrability coefficient.
  • If action-based comparison:
    • Assuming that the preference feedback is based on the Q value at the state with two actions.
    • Algorithm: MLE the advantage function and output the greedy policy.
    • Analysis: concentrability coefficient.

优点

  • The algorithm and analysis are sound.
  • Considering multiple cases, e.g., unknown transition, action-based comparison, and fitting them within similar algorithmic framework.
  • Nice comparison with previous work in Sec.4.1, Remark 2, 4.

缺点

  • This paper is fully focusing on theory. It would be nice to also have some empirical results.

问题

  • It seems to me that the link function in Eq.1 is known? Would it work if the link function is unknown?
  • For unknown transition kernels, would the result extend to the case where the transition kernels' data are sampled separately from the preference data?
评论

Thank you very much for the positive feedback! Here are our responses:

  • Empirical Results

We are currently considering adding an empirical case study in future work. That said, we indeed have some discussions about the practical implementation of our algorithm in Appendix B. In summary, we use the Lagrangian multiplier to convert our algorithm into a maximin optimization problem and demonstrate how to compute the gradient of the objective function so that we can use gradient ascent-descent to solve it. Given that gradient descent-ascent has shown superior performance in standard offline RL problems [1], we believe that the above approach can also work well in practice.

  • Link Function

Yes, we assume the link function is known. When the link function is unknown, we may consider a link function class F\mathcal{F}. Then we can generalize our analysis by implementing MLE over the joint function class (F,Gr)(\mathcal{F}, \mathcal{G} _ r), i.e., (r^,Φ^)=argmaxrGr,ΦFn=1NlogPr,Φ(o=onτn,1,τn,0)(\hat {r}, \hat{\Phi}) = \arg\max _ {r \in \mathcal{G} _ r, \Phi\in\mathcal{F}} \sum _ {n=1}^N \log P _ {r,\Phi}(o=o^{n}\mid\tau^{n,1},\tau^{n,0}) where Pr,Φ(o=1τ1,τ0):=Φ(r(τ1)r(τ0))P_{r,\Phi}(o= 1\mid\tau^1, \tau ^ 0) := \Phi( r(\tau ^ 1) - r (\tau ^ 0)). We leave its formal analysis to our future work.

  • Unknown Transition Kernel

Yes, the analysis still holds. We reuse the dataset in order to reduce sample complexity, but you can definitely collect a different dataset for transition learning as long as the dataset has a bounded single-policy concentrability coefficient (Definition 3).

[1] Rigter, M., Lacerda, B., and Hawes, N. (2022). Rambo-rl: Robust adversarial model-based offline reinforcement learning. Neurips 22.

评论

Thank you for your comment!

评论

You are welcome!

审稿意见
8

This paper analyzes preference-based reinforcement learning in offline setting. Specifically, they propose three algorithms for Offline PbRL. One estimates the reward function using MLE, constructs confidence sets around the MLE and optimizes the policy against the worst reward model under the assumptions that the transition probabilities are known. The second algorithm is similar to the first except that it operates under the assumptions that the transition probabilities are unknown. The algorithm estimates the transition probabilities using MLE and constructs uncertainty sets around the MLE of the transition probabilities. It then optimizes the policy against the worst reward model and the transition probabilities in their respective uncertainty sets. The third algorithm considers the case where preferences are established over actions instead of trajectories. It estimates the advantage function and computes a greedy policy based on the advantage function. While the paper does not empirically evaluate these algorithms, they theoretically analyze the sample complexity and performance error of the algorithms and show that as long as the offline data covers the target policy, it can compete with the target policy. They also show that even if the reward is trajectory wise, you can still efficiently learn the policy if the transition dynamics are estimated per step. Finally, the paper establishes a partial coverage guarantees for the third algorithm and show that the sample complexity scales with a bound on the advantage function.

优点

The paper is clearly written. The paper provides several strong and novel theoretically results on preference-based RL in offline setting. Some of these results also generalize the results of Zhu et al for linear function approximators to general function approximators.

缺点

Although the main contribution is theoretical, it would be nice to see an empirical case study or two where the given algorithm works as expected.

问题

No questions.

评论

Thank you very much for your positive feedback! We are currently considering adding an empirical case study in future work. That said, we indeed have some discussions about the practical implementation of our algorithm in Appendix B. In summary, we use the Lagrangian multiplier to convert our algorithm into a maximin optimization problem and demonstrate how to compute the gradient of the objective function so that we can use gradient ascent-descent to solve it. Given that gradient descent-ascent has shown superior performance in standard offline RL problems [1], we believe that the above approach can also work well in practice.

[1] Rigter, M., Lacerda, B., and Hawes, N. (2022). Rambo-rl: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581.

评论

Thank you for the comment!

评论

You are welcome!

AC 元评审

The authors address the problem of offline preference-based reinforcement learning (PbRL) with human feedback. Based on feedback in the form of preferences between trajectory pairs, they estimate the latent reward and then essentially formalise the learning task as a planning problem. For this setting, they provide interesting theoretical results and formal guarantees.

Although the reviewers raise a number of critical points in their original reports, there is agreement that the paper has great potential and makes a significant contribution. The authors showed a high level of commitment during the rebuttal phase and did their best to respond to the comments and to improve the submission. This was appreciated and positively acknowledged by all. In the discussion between authors and reviewers, some critical points could be resolved and some questions clarified. Although practical results and experiments would have further strengthened the paper, there is agreement that the paper has enough merit to be accepted.

为何不给更高分

Strong paper, but there are even better ones.

为何不给更低分

Strong enough paper.

最终决定

Accept (spotlight)