Fewer Questions, Better Answers: Efficient Offline Preference-based Reinforcement Learning via In-Dataset Exploration
An effective query method for offline preference-based RL.
摘要
评审与讨论
This paper investigates offline preference-based reinforcement learning with the focus on two challenges, i.e., inefficient exploration and overoptimization of reward models. Specifically, this paper proposes OPRIDE, which consists of (1) an exploration strategy to select queries, and (2) a discount scheduling mechanism to avoid overoptimization. Some theoretical insights on the exploration strategy are presented. Extensive experiments are conducted to demonstrate the effectiveness of OPRIDE under an extremely constrained query budget of 10.
优点
- The techniques seems simple yet effective under an extremely constrained query budget of 10.
- OPRIDE is evaluated on MetaWorld, AntMaze, MuJoCo tasks.
缺点
-
Could you demonstrate how to infer better query efficiency of OPRIDE than random query from Theorem 4?
-
The design of discount scheduling seems to be trivial. Any empirical or theoretical insights to support its motivation?
-
Could you provide hyper-parameter analysis on ?
-
Why this paper limits the query budget to 10 (although enlarged to 1~20 in Fig.2)? I think this budget is far too constrained considering the practical scenarios, especially for LLMs. Could you please provide some empirical results of more query budget like 200? Will the performance of OPRIDE deteriorate in this case?
-
In terms of query efficiency discussed in Fig. 2, does PT refers to random query? Could you please provide results of OPRIDE w/ random query?
问题
See weakness above.
Dear Reviewer:
Thanks for finding our method simple yet effective. The main issues regarding experiments and theory are addressed as follows.
W1: Could you demonstrate how to infer better query efficiency of OPRIDE than random query from Theorem 4?
A for W1: Using a proper query selection criteria is crucial in reducing the preference error in Theorem 4. Similar to exploration in bandit, random exploration does not lead to efficient inference of the optimal action and incurs large regret. Using random queries lead to a much larger reward function estimation error (second term in Equation (15)) and therefore less efficienct than OPRIDE.
W2: The motivation of the discount scheduling.
A for W2: Intuitively, a learned reward function is prone to overoptimization [1, 2], leading to overestimation of a policies' return and, therefore, a pessimistic estimation is required to avoid such exploitation of the imperfect reward function. For this reason, we propose to reduce the discount factor based on the variance of the value function estimates as a pessimistic mechanism, which is more effective than previous ensemble-based methods like REDQ. Without any pessimistic estimation, we will need much more queries so that the estimation reward function becomes good enough to resist such overoptimization.
We conduct additional ablation studies for the discount scheduling (IDE). The experimental results in Table 1 show that the discount scheduling is necessary for OPRIDE.
| Task | OPRIDE | OPRIDE w/o IDE |
|---|---|---|
| bin-picking | 93.33.2 | 71.99.0 |
| button-press-wall | 77.70.1 | 77.20.8 |
| door-close | 94.81.1 | 72.30.1 |
| faucet-close | 73.10.8 | 59.48.5 |
| peg-insert-side | 79.00.2 | 13.84.4 |
| reach | 88.00.5 | 83.30.1 |
| sweep | 78.51.0 | 28.71.8 |
Table 1. Ablation study for OPRIDE.
W3: Hyper-parameter analysis on .
A for W3: As suggested, we conduct additional ablation studies on the ensemble number . The experimental results on Table 2 show that the performance of OPRIDE increases with the ensemble number since larger ensemble provides more diverse information. In addition, we find that OPRIDE performs sufficiently well with around .
| 2 | 5 | 10 | |
|---|---|---|---|
| bin-picking | 93.33.2 | 94.12.4 | 94.92.6 |
| button-press-wall | 77.70.1 | 78.10.7 | 80.80.9 |
| door-close | 94.81.1 | 95.61.7 | 96.11.1 |
| faucet-close | 73.10.8 | 75.41.1 | 78.50.9 |
| peg-insert-side | 79.00.2 | 80.90.2 | 81.80.3 |
| reach | 88.00.5 | 89.50.9 | 90.01.2 |
| sweep | 78.51.0 | 79.91.7 | 80.31.4 |
Table 2. Hyper-parameter analysis on .
W4: Why this paper limits the query budget to 10 and Could you please provide some empirical results of more query budget like 200?
A for W4: This is because in our experiments, we find that 10 queries are sufficient to achieve good performance in the evaluated tasks. Further, we conduct experiments with more query budgets. The experiment results on Table 3 show that the performance of OPRIDE improves with the query budget increases, and the performance is sufficiently good with the 10 budget.
| Query Budget | 10 | 200 | 300 |
|---|---|---|---|
| bin-picking | 93.33.2 | 93.92.4 | 94.91.2 |
| button-press-wall | 77.70.1 | 79.10.7 | 80.80.9 |
| door-close | 94.81.1 | 95.60.7 | 96.51.1 |
| faucet-close | 73.10.8 | 75.42.1 | 78.51.9 |
| peg-insert-side | 79.00.2 | 80.40.4 | 86.10.3 |
| reach | 88.00.5 | 88.50.9 | 89.00.8 |
| sweep | 78.51.0 | 81.90.7 | 85.80.4 |
Table 3. Additional experiments with more query budgets.
W5: Does PT refers to random query and Could you please provide results of OPRIDE w/ random query?
A for W5: Yes, PT refers to random query. As suggested, we conduct experiments by replacing query selection module in OPRIDE with random query (named OPRIDE (Random)). The experimental results in Table 4 show that the in-dataset exploration query mechanism is necessary for OPRIDE.
| Task | OPRIDE | OPRIDE (Random) |
|---|---|---|
| bin-picking | 93.33.2 | 70.99.4 |
| button-press-wall | 77.70.1 | 68.19.7 |
| door-close | 94.81.1 | 87.60.7 |
| faucet-close | 73.10.8 | 57.412.1 |
| peg-insert-side | 79.00.2 | 74.94.2 |
| reach | 88.00.5 | 86.51.9 |
| sweep | 78.51.0 | 69.91.7 |
Table 4. Ablation study for OPRIDE.
Thanks again for the valuable comments. We hope our response has cleared your concern. We are looking forward to more discussions.
[1] Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023
[2] Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf. arXiv preprint arXiv:2401.16335, 2024
Thanks for your detailed response. I still have two follow-up questions:
For W1, how large the reward function estimation error is should be derived for comparison between random query and the proposed query strategy in the theoretical analysis. Otherwise, the proposed theorem cannot fully demonstrate the advantages of your proposed algorithm.
For W4, increasing the budget would definitely lead to better performance. My comments meant the performance difference w.r.t random query strategy would deterioriate under higher query budget. So could you provide the performance of random query strategy w/ different query budget?
Dear Reviewer,
Thanks for the quick reply. We will address your follow-up questions below.
Q1: Theoretical comparison between random query and the proposed query strategy.
A for Q1: Following the analysis of fitted Q-iteration [1], when using random queries drawn from a distribution , and keeping another part of our algorithm the same, the suboptimality bound will have the following form:
where . It is easy to show that . So, using random queries leads to a worse bound by a factor of on the reward error. Specifically, becomes unbounded with an infinite state space, while can remain bounded beyond linear function approximations. Intuitively, when we have a large hypothesis space for the value function, using random queries can be hopeless, while there may still be structures that allow strategic queries to be efficient.
Q2: The performance difference w.r.t random query strategy.
A for Q2: As suggested, we conduct additional Random query strategy experiments with various query budgets. The experimental results in Table 1 show that as the query budget increases, both Random and OPRIDE performance will gradually improve, and the gap between them will decrease progressively. This is because when the query budget is sufficiently large, OPRIDE's performance approaches the true reward. On the other hand, we find that the performance of OPRIDE with a small number of queries requires more queries from the Random strategy, which fully demonstrates the query efficiency of our method.
| Query Budget | 10 (OPRIDE) | 10 (Random) | 200 (OPRIDE) | 200 (Random) | 300 (OPRIDE) | 300 (Random) |
|---|---|---|---|---|---|---|
| bin-picking | 93.33.2 | 31.916.2 | 93.92.4 | 54.33.9 | 94.91.2 | 70.12.9 |
| button-press-wall | 77.70.1 | 58.80.9 | 79.10.7 | 64.91.4 | 80.80.9 | 80.20.8 |
| door-close | 94.81.1 | 65.110.1 | 95.60.7 | 74.23.3 | 96.51.1 | 82.31.2 |
| faucet-close | 73.10.8 | 57.80.9 | 75.42.1 | 64.22.2 | 78.51.9 | 69.45.8 |
| peg-insert-side | 79.00.2 | 16.80.1 | 80.40.4 | 80.20.4 | 86.10.3 | 83.80.4 |
| reach | 88.00.5 | 82.00.8 | 88.50.9 | 84.30.9 | 89.00.8 | 85.30.7 |
| sweep | 78.51.0 | 8.00.4 | 81.90.7 | 68.12.1 | 85.80.4 | 80.72.8 |
Table 1. Comparison with Random query strategy with various query budgets.
Best,
The Authors
References
[1] Munos, Rémi, and Csaba Szepesvári. "Finite-Time Bounds for Fitted Value Iteration." Journal of Machine Learning Research 9.5 (2008).
Thanks for your reply. I think the comparison with random query strategy w.r.t. different query budget should be included into the manuscript to better demonstrate the effectiveness of the proposed algorithm. I decide to raise my score by 1.
We would like to thank the reviewer for raising the score to 6! We also appreciate the valuable comments, which helped us significantly improve the paper's strengths.
The paper presents several contributions in the context of offline preference-based reinforcement learning:
- a query selection technique to reduce the number of queries asked to a human
- a dynamic discounting approach to account for higher uncertainty of returns
- a theoretical analysis to prove the efficiency of the proposed approach
优点
The paper presents two new techniques for preference-based reinforcement learning: one for selecting queries and one for handling the uncertainty of value estimation. I believe that both could also be useful in the online reinforcement learning setting. The authors may want to comment on that point in the paper.
The paper also provides a theoretical analysis assuming that queries can be generated online, although policy training is still performed in the offline reinforcement learning setting.
缺点
I believe that the related work should mention reinforcement learning from human feedback (RLHF) since the term "RLHF" has recently become more popular than the term "preference-based reinforcement learning".
The formalization is a bit strange, may be overly complexed, and may contain some errors, e.g.: In the presentation of MDPs, why is it important to use the non-stationary definition? Why do we need both Bellman operators? Why are the returns in (6) not discounted? What's the impact of learning a return model vs a learning a reward model? Also, do you learn a return function instead of a reward function in the experiments? If return models are used, why not use "return network" in the whole paper and notably in Algorithm 1 to make things less confusing? On the comment about the query selection criterion, the authors state that their proposition is scale sensitive. Isn't it also the case for the variance-based criterion?
The theoretical analysis considers a different algorithm compared to Algorithm 1. In addition, the setting for the theoretical analysis is a bit strange to me. The authors assume that online interaction is possible for querying, but not for policy learning. The authors should discuss this point in more details. Does the theoretical analysis really justifies their proposed criterion in Algorithm 1?
Minor issues: Citations between parentheses should not be part of a sentence. With out -> Without The discount factor is missing in (2) Line 146: the optimal policy -> an optimal policy (since it may not be unique) Definition 2: please recall the definition of \alpha'-independent Line 202: Shall R be \theta? Line 252: a space is missing before Intuitively Line 265: r_\theta_1 -> r_\theta_i Line 272: imitaition -> imitation Line 329: \hat q_1 -> \hat q_R
问题
Here are my main questions:
- What's the impact of learning a return model vs a learning a reward model? Also, do you learn a return function instead of a reward function in the experiments? If return models are used, why not use "return network" in the whole paper and notably in Algorithm 1 to make things less confusing?
- On the comment about the query selection criterion, the authors state that their proposition is scale sensitive. Isn't it also the case for the variance-based criterion?
- Does the theoretical analysis really justifies their proposed criterion in Algorithm 1?
Dear Reviewer,
Thanks for your valuable comments. We hope the following statement can address your concern.
W1: Discussion of RLHF in the related work section.
A for W1: Thanks for your suggestion, we have corrected it in the related work section in the updated version.
W2, Q1 and Q2: Clarification of formalization.
- Why is it important to use the non-stationary definition?
We are not sure what exactly the reviewer means by "non-stationary" and assume that "non-stationary" means "heterogenous", that is, the transition function and the reward function can be different at each timestep. This is not crucial for the theoretical analysis but it makes the analysis more general. That is, homogenous MDPs are a special case of our analysis.
-
Why do we need both Bellman operators? Bellman evaluation operator can be used to simplify our theoretical analysis, please refer to the proof of Lemma 5 for more details.
-
Why are the returns in (6) not discounted? The return is not discounted because we consider the finite horizon case in the theoretical analysis.
-
What's the impact of learning a return model vs a learning a reward model? A return model, or more accurately a value model, is crucial in selecting informative queries. Most prior works select informative queries by maximizing disagreement on the reward function, while we select queries by maximizing disagreement on the value function. Intuitively, disagreement over the reward function on the low-return regions does not affect the judgement of optimal policies, therefore it can be less efficient comparing to using disagreement over value functions.
-
Do you learn a return function instead of a reward function in the experiments? We learn a value function over the reward function in the experiments and use the value function as query selection criteria.
-
Variance-based Query selection is also scale-sensitive . We agree that variance-based query selection is also scale-sensitive. In the context, we are comparing to the disagreement-based method [1] which only consider the numbers of preference each candidate get, which is scale-insensitive.
W3 and Q3: Does the theoretical analysis really justifies their proposed criterion in Algorithm 1?
A for W3 and Q3: When the size of the dataset approaches to infinity, any online trajectory can be found in the offline dataset so that select queries from the dataset is equivalent to rollout online queries. Therefore, when , where is the size of the dataset and is the number of queries, our theoretical analysis is a proper approximation for the pure offline setting, and query selection principles still applies.
W4: Minor issues.
A for W4: Thanks for your detailed suggestion, we have corrected it in the updated version.
Thanks again for the valuable comments. We hope our response has cleared your concern. We are looking forward to more discussions.
[1] Shin, Daniel, Anca Dragan, and Daniel S. Brown. "Benchmarks and Algorithms for Offline Preference-Based Reward Learning." Transactions on Machine Learning Research.
Dear Reviewer,
We have added additional explanations for our methods. We are wondering if our response and revision have cleared your concerns. We would appreciate it if you could kindly let us know whether you have any other questions. We are looking forward to comments that can further improve our current manuscript. Thanks!
Best regards,
The Authors
I'd like to thank the authors for their responses. While they clarified a few points, I still believe that the presentation is overly complicated:
-
In my opinion, presenting with the non-stationary MDP model (I don't think I've seen references using "heterogeneous" to refer to "non-stationary" before) is not necessary. In general, many theoretical results can be extended to the non-stationary case, because there's no much technical difficulty doing that. Also, all the experiments are run in the stationary case, if I'm not mistaken.
-
The Bellman operator B is not really standard and seems not be used in the main paper. I think if it's really needed, it would be better to defer its definition to the proofs. Also, i believe it can simply be defined with respect to operator T.
In addition, I suggest the authors to better explain in the paper:
-
the discrepancy between using the discount factor in some definitions and the considerations of the finite horizon in some part of the paper;
-
the point about learning a reward model, a return model, or value model (I believe that in the experiments, a value model is not learned directly, but via either a reward model or a return model);
-
the point about scale-sensitivity. The updated text mentions two characteristics, but then lists three points. Also, I think the text still implicitly suggests that variance is not scale-sensitive. However, why is this property an advantage?
Finally, I still believe that there's a gap between the theoretical analysis and the real algorithm. This point is moreover still not discussed in the updated paper.
For all these reasons, I prefer to keep my current score.
Dear Reviewer,
Thank you for your detailed feedback and suggestions, which have been invaluable in improving our work. Below, we address your concerns point by point:
W1: Non-stationary MDP.
We appreciate your feedback regarding the complexity of our current formulation. To clarify, we used "heterogeneous" to distinguish our scenario from the common interpretation of "non-stationary" MDPs, which typically assume changes in dynamics and rewards across episodes [1, 2]. Our formulation instead allows variations in dynamics and rewards across horizons within a single episode, which is standard.
In response to your suggestion, we have simplified the theoretical section by adopting a stationary, infinite-horizon formulation. This simplified version aligns better with our empirical approach and streamlines the presentation.
W2: The Bellman operator is not really standard and seems not to be used in the main paper.
We appreciate the reviewer’s observation. In response, we have moved the definition of the Bellman operator to the Appendix, where it is used only in the proofs, and have revised the main text accordingly to avoid unnecessary complexity.
W3: Discrepancy between discount factor and finite horizon considerations.
As noted in W1, we now use a stationary, infinite-horizon formulation throughout the theoretical analysis. This resolves the earlier inconsistency between the discount factor and finite-horizon considerations, ensuring coherence across the paper.
W4: Learning a reward model, return model, or value model.
The reviewer is correct that the theoretical analysis employs a return model, while the empirical methodology involves learning a reward model first and then deriving the value model. These approaches are theoretically equivalent, but the latter often provides better empirical performance. We will update the manuscript to clarify this distinction and its motivation, ensuring alignment between theoretical and empirical discussions.
W5: Scale-sensitivity.
We have clarified scale-sensitivity in the revised manuscript. This concept ensures preference for queries with larger value disagreements, leading to stronger theoretical guarantees. For example, if we have two value models and , and for query , the disagreement between the models is 0.01, while for , the disagreement is 1, a scale-sensitive criterion will choose because it is more informative. Without scale-sensitivity, the two queries would be treated equivalently, and either could be chosen randomly. This makes scale-sensitivity advantageous for learning from the true value distribution.
W6: Gap between theoretical analysis and real algorithm.
We understand the reviewer’s concern regarding the gap between theoretical analysis and algorithm implementation. Our analysis assumes online interaction for querying but not for policy learning, a simplifying assumption that can be relaxed.
In cases where we use purely offline queries (instead of online queries), we must ensure that our queries are well-represented in the dataset. This introduces an additional error term on the order of . This error is negligible when (as it becomes zero, as in the online case) or when (as the error becomes much smaller than the reward error term). A more detailed analysis is available at This Link.
Thus, the theoretical framework now can align with empirical practices, justifying our query selection criteria. We will discuss this assumption in more detail in the updated manuscript, as suggested by the reviewer.
We sincerely hope this addresses your remaining concerns. We look forward to further discussion!
Best,
The Authors
References
[1] Lecarpentier, Erwan, and Emmanuel Rachelson. "Non-stationary Markov decision processes, a worst-case approach using model-based reinforcement learning." Advances in neural information processing systems 32 (2019).
[2] Besbes, Omar, Yonatan Gur, and Assaf Zeevi. "Stochastic multi-armed-bandit problem with non-stationary rewards." Advances in neural information processing
The paper proposes an active learning method for offline preference-based reinforcement learning, ie. a method to select which trajectories from a dataset to label with pairwise preferences. The proposed method consists of a value difference-based query selection criterion and a discount adaptation rule based on the variance of a reward model ensemble. The paper provides a theoretical result about this algorithm as well as positive empirical results compared to alternative methods.
优点
- The method is novel, well-motivated and a nice simplification compared to prior work.
- The discount adaptation schedule is a neat idea to reduce reward model overoptimization and it seems to work well in practiec.
- The empirical evaluation is thorough, looks at multiple environments and includes most of the important ablations and experiments.
- Overall, the empirical results seem strong over a variety of environments, making this a promising method.
- The paper is well-written and easy to read.
缺点
-
The paper does not make the key contributions clear enough
- The introduction focuses on the offline version of the PbRL problem as being the main novelty. However, prior work very commonly considers an offline setting or can be readily adapted. So, I don't think this is the important part of the contribution here.
- Instead, I think the novel algorithm, theoretical result and empirical performance should be highlighted more. In particular, the variance-based discount scheduling is novel (at least to me) and could be highlighter more.
-
The related work section should explain the differences to prior work more
- How is the method different from disagreement based method?
- How is the setting different from the discussed works on semi-supervised offline RL?
- How do the theoretical results compare to prior theoretical work?
-
Comparison to prior methods is lacking a bit, especially in the method section. The objective in equation (8) seems to have strong connections to prior work. In particular, Lindner et al. consider the variance over the value difference between two trajectories. The present paper considers two reward functions that maximize the value difference between trajectories. This seems very related to estimating the variance and I think it would warrent some more discussion of the differences.
-
The discussion of the theoretical results needs to be improved
- The main paper should give an intuition for the proof technique or a proof sketch and analyze the result more. As a reader, I'd like to know which parts of the algorithm are load-bearing and which components in the final bound are due to which part of the algorithm and the analysis. Currently, I cannot learn about this without reading the full proof in the appendix.
- Currently the paper is lacking any comparison of the theoretical
- IIUC the variance scheduling is not part of the theoretical variant of the algorithm. The section in the main paper does not make this clear, and I only learned this from reading the appendix. This seems like an important point to clarify and discuss the limitations of.
-
The experiments are only done in quite basic RL settings. PbRL is most used in LLM settings in practice, and the paper doesn't provide any evidence for the method's advantages carrying over to that setting.
-
The choice of baselines and ablations has a few gaps:
- Why not run IDRL in Antmaze or Figure 2?
- Why does Table 6 in the Appendix not have IDRL results?
- What about no variance scheduling (and no PDS)? It would be interesting to see how much worse that performs.
- Why not run IDRL in Antmaze or Figure 2?
-
The proposed method has practical downsides that are not sufficiently discussed in the paper
- A common practical problem of this kind of method is having to train ensembles of reward models and policies which can get expensive in time and computational cost. The algorithm proposed here requires training M reward, value, and Q-models, which can be a limitation.
- The experiments are done in a quite idealised setting. For example, IIUC the data is collected from a perturbed expert policy. In practice it would be important to understand how well the algorithm can deal with less ideal data and other condition.
问题
General questions
-
The focus on a return model instead of a reward model makes the setting very similar to dueling bandit problems. How does the proposed method compare to typical approaches there and could dueling bandit algorithms be adapted to serve as baselines to compare your approach to?
-
How does the theoretical bound compare to results in prior work and typical results (eg., in multi-armed bandits)? Is it likely that the bound can be improved with more careful analysis? Is it possible to say something about a lower bound of the regret?
-
Will code for the experiments be released to enable others to reproduce the work?
Questions about experiments
- Figure 2: How many datapoints are the boxplots computed over?
- Why not run IDRL in Antmaze or Figure 2?
- Why does Table 6 in the Appendix not have IDRL results?
- Could you use dueling bandit algorithms as a baseline?
Overall assessment: In the current status, I think this is a borderline paper but I'm tending towards accepting it. If my questions and concerns can be addressed during the discussion phase, I'll happily increase my score.
Dear Reviewer,
Thanks for finding our paper novel, well-motivated, nice simplification, and thorough experimental results. We hope the following statement clear your concern.
W1: Key contributions are not clear enough.
A for W1: Thanks for your suggestion, we have corrected it in the introduction section in the updated version.
W2: Differences to prior work in the related work section.
A for W2: Thanks for your suggestion, we have explained more differences to prior work in the related work section in the updated version.
W3: Comparison to prior methods.
A for W3: While our method and the prior works (e.g., IDRL [1]) have similar connections, our approach introduces several novel contributions:
-
Unlike IDRL, which relies on the Laplacian approximation and the Hessian matrix for posterior computation, our method leverages critic values for query selection, ensuring easier implementation and superior empirical performance, as demonstrated in our comparative experiments in the paper.
-
Additionally, unlike IDRL, our method provides a concrete suboptimality bound and regret analysis, enhancing its theoretical robustness.
Thanks for your suggestion, we have added more discussion of the differences in the updated version.
W4 and Q2: Discussion of the theoretical results.
-
Intuition for the proof technique or a proof sketch: We thank the reviewer for the suggestion. The proof consists of two main steps. The first step decomposes the suboptimality into two parts: the dynamical error and the reward function error. This is only possible when we use a pessimistic reward function (i.e., Equation (9) and the confidence set selection in Equation (12) ). The dynamic error part can be well-bounded when using a proper offline algorithm. We are interested in the reward function error due to limited preference data. The second step aims to bound such error with the explorative query selection criteria (i.e., Equation (8) and (14)). Without such a criteria, the reward function error part can be unbounded.
-
Comparison of the theoretical results: Our result gives the first finite sample complexity bound for PbRL with a offline dataset and online queries with general function approximation. Compared to pure online setting [2,3], our result applies to offline dataset with online query setting; compared to linear setting [2], our result extend to general function approximation setting with finite Eluder dimensions. Compared to [1], our analysis give a rigorous complexity bound.
-
Theoretical roots for variance scheduling: We thank the reviewer for raising the point. While the variance scheduling is an empirical approximation for the pessimism principle in Equation (12), there are theoretical evidence that they are equivalent in terms of regularization effect [4] and model-based pessimism principles [5].
W5: Evidence on the LLM settings:
A for W5: As suggested, we have conducted experiments on the LLM setting using and the dataset. We use a total query budget of 1000 samples from the training dataset, which comprises 30000 samples in total. The following table shows the score for different methods from the reward model (trained with the full training dataset) on the evaluation dataset.
| OPRIDE | OPRIDE w/o tuning | baseline (random selection) | |
|---|---|---|---|
| RM Score | 1.24 | 0.72 | 0.65 |
Table 1. Experimental results on the LLM setting.
Additionally, we assessed the performance of our method using as the judge. The results clearly show that our method outperforms the baseline by a large margin.
| Win Rate (row over col) | OPRIDE | baseline |
|---|---|---|
| OPRIDE | - | 54.70.3 |
| baseline | 45.30.3 | - |
Table 2. Experimental results on the LLM setting.
W6, Q5 and Q6: Choice of baselines and ablations:
-
Results of IDRL on Antmaze, Figure 2 and Table 6 in Appendix: As suggested, we have added the experimental results of IDRL on Antmaze, Figure 2 and Table 6 in Appendix in the revised version. The experimental results show that our method achieves better performance than baselines.
-
What about no variance scheduling? As suggested, we conduct ablation studies for OPRIDE. Specifically, we remove the variance scheduling (named OPRIDE w/o VDS) and the in-dataset exploration module (named OPRIDE w/o IDE) respectively. The experimental results in Table 3 show that the effectiveness of each part of our algorithm.
| Task | OPRIDE | OPRIDE w/o VDS | OPRIDE w/o IDE |
|---|---|---|---|
| bin-picking | 93.33.2 | 70.99.4 | 71.99.0 |
| button-press-wall | 77.70.1 | 68.19.7 | 77.20.8 |
| door-close | 94.81.1 | 87.60.7 | 72.30.1 |
| faucet-close | 73.10.8 | 57.412.1 | 59.48.5 |
| peg-insert-side | 79.00.2 | 74.94.2 | 13.84.4 |
| reach | 88.00.5 | 86.51.9 | 83.30.1 |
| sweep | 78.51.0 | 69.91.7 | 28.71.8 |
Table 3. Ablation study for OPRIDE.
W7: Discussion of practical downsides:
- Computational cost of training ensembles of reward models and policies: Thanks for your suggestion. We present the computational cost on the GeForce RTX 3090 GPU device, as shown in Table 4. The experimental results indicate that as the number of Ensembles increases, the computational cost does not increase significantly. This is because we have adopted a multi-head mechanism instead of separate training, thereby saving training time.
| 2 | 5 | 10 | |
|---|---|---|---|
| bin-picking | 1.1h | 1.2h | 1.3h |
| button-press-wall | 1.1h | 1.1h | 1.2h |
| door-close | 1.1h | 1.2h | 1.3h |
| faucet-close | 0.9h | 1.0h | 1.2h |
| peg-insert-side | 1.0h | 1.1h | 1.3h |
| reach | 1.1h | 1.2h | 1.3h |
| sweep | 1.2h | 1.3h | 1.4h |
Table 4. Computational cost for OPRIDE with ensemble , where h is the hour.
- Deal with less ideal data: In our experiments, we adopt the non-ideal data rather than ideal setting. Specifically, each dataset consists of 1000 trajectories. 50 trajectories of which are collected by the corresponding scripted policy added with a Gaussian noise to increase diversity, and the rest 950 trajectories are collected with a policy that is a -greedy variant to the former noisy policy and select random actions with probability .
Q1 and Q7: How does the proposed method compare to typical approaches there and could dueling bandit algorithms be adapted to serve as baselines to compare your approach to?
A for Q1 and Q7: We use Q-values rather than returns in the selection phase. This is in stark contrast with bandit algorithms which does not consider credit assignments between different timesteps. Per the reviewer's suggestion, we added additional baselines that use dueling bandit algorithms as query selection criteria. As shown in Table 5, our method significantly outperforms the dueling bandit algorithm.
| Task | OPRIDE | OPRIDE with Dueling Bandit |
|---|---|---|
| bin-picking | 93.33.2 | 72.16.3 |
| button-press-wall | 77.70.1 | 71.80.6 |
| door-close | 94.81.1 | 76.82.2 |
| faucet-close | 73.10.8 | 60.72.5 |
| peg-insert-side | 79.00.2 | 64.74.7 |
| reach | 88.00.5 | 82.30.8 |
| sweep | 78.51.0 | 67.91.4 |
Table 5. Comparison with Dueling Bandit.
Q3: Will code for the experiments be released to enable others to reproduce the work?
A for Q3: Yes, the code is available in the supplementary material and our code will be released.
Q4: How many datapoints are the boxplots computed over in Figure 2?
A for Q4: We use five seeds for each datapoint in boxplots in Figure 2.
We sincerely thank the reviewer again for the timely and valuable comments. We hope that our response and additional experimental results have cleared most of your concerns.
[1] Lindner, David, et al. "Information directed reward learning for reinforcement learning." Advances in Neural Information Processing Systems 34 (2021): 3850-3862.
[2] Pacchiano, Aldo, Aadirupa Saha, and Jonathan Lee. "Dueling rl: reinforcement learning with trajectory preferences." arXiv preprint arXiv:2111.04850 (2021).
[3] Chen, Xiaoyu, et al. "Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation." International Conference on Machine Learning. PMLR, 2022.
[4] Jiang, Nan, et al. The dependence of effective planning horizon on model accuracy.
[5] Hu, Hao, et al. On the role of discount factor in offline reinforcement learning.
Dear Reviewer,
We have conducted additional experiments on LLM, computational costs and ablations. We are wondering if our response and revision have cleared your concerns. We would appreciate it if you could kindly let us know whether you have any other questions. We are looking forward to comments that can further improve our current manuscript. Thanks!
Best regards,
The Authors
Thanks you for the extensive response and providing the additional results. This clarifies many points and addresses most of my concerns.
However, when reading through the other reviews and your responses, I found myself wondering about the limitations of the theoretical analysis and was not yet convinced by your response on this point. In particular, in discussion with reviewer FYzH, you say
When the size of the dataset approaches to infinity, any online trajectory can be found in the offline dataset so that select queries from the dataset is equivalent to rollout online queries.
I did not originally realize that this kind of assumption was in place for the theoretical analysis and I find it rather strong, particularly for a paper that is focused on the offline setting. Could you expand on the justification for this assumption and comment on the extend to which your results apply in a setting where the offline dataset is of finite (and practical) size?
Dear Reviewer,
Thank you for your thoughtful feedback and for noting that our previous response addressed most of your concerns. We deeply appreciate your engagement with our work.
Regarding the online query assumption in the theoretical analysis, we would like to clarify that it is only used to simplify the analysis and can indeed be relaxed. When using purely offline queries, we need to ensure that the queries are well-represented in the dataset. This introduces an additional error term on the order of . This error is negligible when (as it becomes zero, as in the online case) or when (as the error becomes much smaller than the reward error term). For a more rigorous description, please refer to This Link.
We sincerely hope this addresses your remaining concerns. We look forward to further discussion!
Best,
The Authors
The paper presents OPRIDE, a novel algorithm aimed at improving query efficiency in offline preference-based reinforcement learning (PbRL). The study tackles the challenge of reducing the costs associated with obtaining human feedback by optimizing the selection of queries and addressing issues related to overoptimizing learned reward functions. OPRIDE employs an exploration strategy that leverages in-dataset trajectory analysis and a variance-based discount scheduling to ensure balanced learning outcomes. Experimental results demonstrate its effectiveness over existing approaches across various tasks in locomotion and manipulation, with a theoretical guarantee of efficiency.
优点
- The OPRIDE algorithm integrates a principled exploration mechanism that maximizes query informativeness, effectively addressing inefficiencies in query selection for offline PbRL.
- The study includes ablation studies that delineate the contributions of each component, alongside comparative results against multiple state-of-the-art baselines, highlighting OPRIDE's query efficiency and overall robustness.
- Though I do not checked all details of the theoretical derivations, the theorem and the proof seem rigorous.
缺点
- This paper seems like a simple and inconsistant combination of the two loosely related methods. The main idea of this paper is to increase query efficiency of PbRL. While in-dataset query exploration is related to this main idea, the connection between reward overoptimization and query efficiency is neither straightforward nor clearly discussed in this paper.
- The novelty of this paper is limited. The two proposed methods, namely query exploration and uncertainty-based discount scheduling, have been extensive explored in previous researches. The former is similar with the unsupervised exploration approach in PEBBEL [1]. The latter is similar with uncertainty-based methods in traditional RL methods, such as REDQ [2] in online RL and MOPO [3] in offline RL.
- In theoretical analysis, the access to online queries is assumed. So the results are poorly related with the practical performance of the proposed algorithm.
References
[1] PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training.
[2] Randomized Ensembled Double Q-Learning: Learning Fast Without a Model.
[3] MOPO: Model-based Offline Policy Optimization.
问题
- Why overoptimization of learned reward functions contributes to the low query efficiency of pbrl? How does the discount scheduling mechanism alleviate this?
- What is the advantage of the current type of query selection compared with PEBBLE?
- Can you provide a more empirical elaboration on the assumption of finite Eluder dimension in the theory part? Why do we need this?
- In theoretical analysis, it is found that querying with an offline dataset can be much more sample-efficient than pure online queries. But how do such findings related to the algorithm pipeline, including in-dataset exploration and the discount scheduling?
伦理问题详情
N/A
Dear Reviewer,
Thanks for your valuable comments. We hope the following statement can address your concern.
W1 and W3: Proposed methods are loosely related and the theoretical results are poorly related with the practical performance.
A for W1: We respectfully disagree with this assessment. Our theoretical analysis suggests that there two critical requirements that leads to an efficient offline PbRL algorithm - a pessimistic reward function and an efficient explorative query selection criteria. We implement the former with the variance-based discount scheduling mechanism and the latter with the disagreement based criteria. These two modules are closely related to our theoretical analysis and both are shown crucial for efficient PbRL algorithms.
W2 and Q2: The novelty of this paper and comparison with previous works.
A for W2 and Q2: We respectfully disagree with this assessment.
- Our query selection is significantly different from PEBBLE, which focus on maximizing information with respect to the reward function, while we focus on the information with respect to the value function. Intuitively, disagreement over the reward function on the low-return regions does not affect the judgement of optimal policies, therefore it can be less efficient comparing to using disagreement over value functions. As shown in Table 1 and 2, our method significantly outperform reward-based query selection (OPRL).
- While there are many uncertainty-based method in previous literature like REDQ and MOPO, we are the first to apply it to the discount factor rather than the reward or value function. This is novel and comes with theoretical guarantees. As shown in our ablation experiment in Table 3, while being simple, such a change leads to significant performance improvement.
| Task | OPRIDE | OPRIDE (Uniform) |
|---|---|---|
| bin-picking | 93.33.2 | 70.99.4 |
| button-press-wall | 77.70.1 | 68.19.7 |
| door-close | 94.81.1 | 87.60.7 |
| faucet-close | 73.10.8 | 57.412.1 |
| peg-insert-side | 79.00.2 | 74.94.2 |
| reach | 88.00.5 | 86.51.9 |
| sweep | 78.51.0 | 69.91.7 |
Table 1. Experiments of OPRIDE with uniformly selecting queries.
Q1: Why overoptimization of learned reward functions contributes to the low query efficiency of pbrl? How does the discount scheduling mechanism alleviate this?
A for Q1: Intuitively, a learned reward function is prone to overoptimization [1, 2], leading to overestimation of a policies' return and, therefore, a pessimistic estimation is required to avoid such exploitation of the imperfect reward function. For this reason, we propose to reduce the discount factor based on the variance of the value function estimates as a pessimistic mechanism, which is more effective than previous ensemble-based methods like REDQ. Without any pessimistic estimation, we will need much more queries so that the estimation reward function becomes good enough to resist such overoptimization.
Q3: More empirical elaboration on the assumption of finite Eluder dimension in the theory part and Why do we need this?
A for Q3: Intuitively, Eluder dimension is the "RL version" of VC dimension. Without a finite Eluder dimension, the reward function can be too large for us draw any meaningful conclusion by observing a small amount of samples. We would also to say that the space of finite Eluder dimension is in fact a general assumption that covers linear settings, etc and appears in many RL theory
Q4: It is found that querying with an offline dataset can be much more sample-efficient than pure online queries. How do such findings related to the algorithm pipeline, including in-dataset exploration and the discount scheduling?
A for Q4: This is a comment regarding the offline setting and the online setting rather than our specific algorithm. Our algorithm pipeline belongs to the former, which select queries with an offline dataset and therefore requires much less queries than pure online methods like [3].
Thanks again for the valuable comments. We hope our additional experimental results and explanation have cleared the concern. We sincerely hope that the reviewer can re-evaluate our paper after seeing the our response. More comments on further improving the presentation are also very much welcomed.
[1] Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023
[2] Zhu, B., Jordan, M. I., and Jiao, J. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf. arXiv preprint arXiv:2401.16335, 2024
[3] Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing systems 30 (2017).
Thank the author for the rebuttal. My concerns with the relationship between the two methods and the idea novelty remain after reading the rebuttal. So the score will be kept unchanged.
Dear Reviewer,
We have conducted additional experiments and explanations for our methods. We are wondering if our response and revision have cleared your concerns. We would appreciate it if you could kindly let us know whether you have any other questions. We are looking forward to comments that can further improve our current manuscript. Thanks!
Best regards,
The Authors
The paper stuided a approach where it does active query selection to improve the query efficiency and has a discount scheduling mechanism to mitgate reward overoptimization. The paper provides theoretical results to support their claims and also empirically verified their approach on domains such as locomition and manipulation. The preseted empirical results are strong. The weaknesses of the paper are in the gap between its theoretical results and empirical algorithms, and the novelty in the ideas when comparing to prior work.
审稿人讨论附加意见
The reviewers acknowledged the authors' effort during the rebuttal phase. However the reviewers in general are not convinced that that paper is ready for publication at this stage. Reviewers in general are concerned about the gap between theoretical results and the empirical algorithms. One reviewer mentioned that while the new theoretical result claimed by the authors (in their last response) would address some issue of the gap between the theoretical analysis and the actual algorithm, the short outline proof makes it hard for the reviewer to verify. One reviewer was still concerned about the novelty of the proposed approach when comparing to prior work that uses uncertainty-based models in RL. Overall, the reviewers and AC think that the paper would benefit from a rewrite to improve clarity and incorporate the latest results.
Reject