PaperHub
6.0
/10
Poster4 位审稿人
最低5最高8标准差1.2
6
8
5
5
3.5
置信度
正确性2.3
贡献度2.5
表达2.5
ICLR 2025

Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-26
TL;DR

Contruct the reduced dataset to improve algorithm performance while accelerating algorithm training.

摘要

关键词
Offline Reinforcement Learning; Data Selection; Grad Match

评审与讨论

审稿意见
6

The authors introduce an approach for reducing the size of a dataset for offline RL by defining this reduction as a submodular set cover problem and using orthogonal matching pursuit. The resulting algorithm is evaluated on a modified version of D4RL locomotion tasks and the original antmaze tasks.

优点

  • This is an interesting and novel approach for data selection in RL. The high-level approach/formulation of the problem may be useful as a foundation for extensions.
  • Strong results on a modified version of D4RL, and the unmodified antmaze.

缺点

There is a discrepancy between the proposed objectives and the resulting objectives that makes me question where the effectiveness of the proposed approach comes from.

The problem is initially defined as finding a subset of the data which results in a higher performing policy than the policy determined by training on the original dataset (Eqn 3). However, this is immediately discarded for another optimization problem, which instead tries to limit change in the value function (Eqn 5). While discovering a smaller dataset which achieves the same performance as the original dataset is an interesting problem, the authors claim in several places (and demonstrate) that their reduced dataset actually improves the performance. So where does the performance gain come from?

One possible cause for the performance increase is how the evaluation is done (add noisy/low performing trajectories to the D4RL dataset) and the filtering of low performing trajectories (Eqn 14). I would be very curious if this filtering alone is sufficient to also recover the performance of the algorithm. This concern, along with some missing key experimental details, makes me cautious about the experimental claims made in the paper.

Missing References which also filter the dataset using returns:

  • [1] Chen, Xinyue, et al. "Bail: Best-action imitation learning for batch deep reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 18353-18363.
  • [2] Yue, Yang, et al. "Boosting offline reinforcement learning via data rebalancing." arXiv preprint arXiv:2210.09241 (2022).

问题

Additional experiments:

  • Does simply filtering the dataset by high returns recover the same performance?
  • What is the performance of ReDOR on the original version of D4RL? One might expect that reducing mixed quality datasets like medium-expert, or medium, could also result in a high performance.

Missing experimental details:

  • How is the hard dataset generated? How many datapoints are added to the dataset?
  • How many datapoints are removed by ReDOR? What is the size of the reduced datasets?

General:

  • Is there a way to tune the resulting dataset size?
  • Is Fig 3, episode return = 99.5 for behaviors [2-7] correct or a bug?
评论

Dear Reviewer,

Thanks for finding our paper interesting and novel. We hope the following statement clear your concern.

W1 and Q1: Where does the performance gain come from? and Does simply filtering the dataset by high returns recover the same performance?

A for W1 and Q1: The reason for using Equation 5 instead of Equation 3 is that in the offline RL setting, we cannot directly solve the problem defined by Equation 3. For this reason, we attempt to adopt Equation 5 as an approximate alternative optimization problem.

The performance gain comes from two aspects: (1) Reduced dataset can eliminate a large amount of redundant data, making the learning process of the algorithm more efficient. (2) we balance data quantity with performance by focusing on data points that are aligned with the learned policy.

As suggested, we conduct ablation study in the D4RL (Hard) tasks by simply filtering the dataset by high returns. The experimental results in Table 1 show that simply filtering data with high returns is not sufficient to achieve good performance. This is because the task cannot be well-defined if only given the expert demos. Diverse trajectories can help specify the boundary of the task.

Finally, we have added the missing references as suggested in the revised version.

D4RL (Hard)Simply FilteringReDOR
walker2d-medium-v065.4±\pm3.680.5±\pm2.9
halfcheetah-medium-v030.5±\pm2.441.0±\pm0.2
hopper-medium-v091.2±\pm3.294.3±\pm4.6
walker2d-medium-replay-v013.4±\pm1.621.1±\pm1.8
halfcheetah-medium-replay-v040.2±\pm0.341.1±\pm0.4
hopper-medium-replay-v031.0±\pm3.135.3±\pm3.2
walker2d-expert-v086.4±\pm2.6104.6±\pm2.5
halfcheetah-expert-v087.2±\pm5.388.5±\pm2.4
hopper-expert-v0110.3±\pm0.1110.0±\pm0.5

Table 1. Comparison with the simply filtering baseline in the D4RL (Hard) tasks.

Q2: What is the performance of ReDOR on the original version of D4RL?

A for Q2: As suggested, we conduct additional experiments on the D4RL (Original) tasks. The experimental results in Table 2 show that the reduced dataset on the original dataset can still bring performance gains to the algorithm.

D4RL (Original)Complete DatasetReDOR
walker2d-medium-v079.7±\pm1.889.3±\pm2.3
halfcheetah-medium-v042.8±\pm0.345.2±\pm0.2
hopper-medium-v099.5±\pm1.0101.4±\pm2.1
walker2d-medium-replay-v025.2±\pm5.140.1±\pm3.8
halfcheetah-medium-replay-v043.3±\pm0.560.1±\pm0.4
hopper-medium-replay-v031.4±\pm3.053.3±\pm2.2
walker2d-expert-v0105.7±\pm2.7108.6±\pm2.3
halfcheetah-expert-v0105.7±\pm1.9110.5±\pm2.5
hopper-expert-v0112.2±\pm0.2115.0±\pm0.5

Table 2. Performance on the D4RL (Original) tasks.

Q3: How is the hard dataset generated? How many datapoints are added to the dataset?

A for Q3: We added noise data generated by various behavioral policies to the original dataset to simulate the noise in data collection in the real world. In each dataset, the size of the added noise data is 20% of the original dataset.

Q4: How many datapoints are removed by ReDOR? What is the size of the reduced datasets?

A for Q4: In our experiments, each dataset reduces by approximately 70% to 90%. Specifically, for Walker2d and Hopper, there will be a reduction of 70%, while for Halfcheetah, there will be a reduction of 90%.

Q5: Is there a way to tune the resulting dataset size?

A for Q5: Yes, we can tune the size of the data subset by changing the approximation error ϵ\epsilon. For example, as ϵ\epsilon increases, the precision of gradient matching decreases, resulting in a smaller data subset. Conversely, as ϵ\epsilon decreases, the data subset becomes larger.

Q6: Is Fig 3, episode return = 99.5 for behaviors [2-7] correct or a bug?

A for Q6: Thanks for your suggestions. It is a bug and we have corrected it in the revised version.

Thanks again for the valuable comments. We hope our response has cleared your concern. We are looking forward to more discussions

评论

Thank you for the response, and the additional experiments.

  1. The missing details you have provided here are helpful for understanding and reproducing your work. Can you make sure these appear in a revised draft of the paper?
  2. I'm not sure I agree that "simply filtering data with high returns is not sufficient to achieve good performance" based on the results you have provided in Table 1, since the performance looks very competitive.
评论

Dear Reviewer,

Thanks for the quick reply. We will address your follow-up questions below.

1: Can the missing details appear in a revised draft of the paper?

A1: Yes, we assure that the added details and experiments will be incorporated into the revised version to enhance the clarity and persuasiveness of the paper. Once again, we express our gratitude to the reviewers. Your suggestions have greatly improved this manuscript.

2: Experiments of simply filtering data.

A2: This is because Mujoco tasks are relatively simple, so the performance looks very competitive. For this reason, we conduct additional experiments on more complex tasks, such as Adroit. Specifically, the dataset for Adroit tasks, including human and cloned data, is more realistic and collected by people. The experimental results in Table 1 show that in the more challenging dataset, ReDOR can achieve a significant performance improvement compared to the Simply Filtering baseline.

AdroitSimply FilteringReDOR
pen-human-v073.2±\pm4.1107.5±\pm3.4
hammer-human-v02.6±\pm0.215.3±\pm0.9
door-human-v05.8±\pm0.111.9±\pm2.6
relocate-human-v00.1±\pm0.04.5±\pm1.1
pen-cloned-v040.6±\pm2.9103.4±\pm4.4
hammer-cloned-v03.7±\pm0.724.2±\pm7.0
door-cloned-v02.4±\pm0.312.4±\pm2.7
relocate-cloned-v00.3±\pm0.12.4±\pm0.3

Table 1. Comparison with the simply filtering baseline in the Adroit tasks.

Best,

The Authors

评论

Thanks for the additional experiments. I'm not convinced by the significance of the results, but it is clear that ReDOR does more than just filtering. I've updated my score (5 -> 6) to reflect this.

评论

We would like to thank the reviewer for raising the score to 6! We also appreciate the valuable comments, which helped us significantly improve the paper's strengths.

审稿意见
8

The paper explores the interesting concept of finding a subset of the offline dataset to improve the performance of offline RL algorithms using orthogonal matching pursuit. The authors provide empirical and theoretical evidence of performance improvement on benchmark datasets.

优点

  1. The paper is well written and the idea is easy to follow.
  2. The idea of subset selection is novel and interesting.
  3. The paper provides both strong theoretical study and empirical analysis of the proposed method.

缺点

  1. The authors characterize the field of offline RL only in terms of OOD action penalization and constraints on the behavior policy. There should also be a short discussion on model-based methods like MOPO [1] and MoERL [2], as some of these approaches have been shown to outperform model-free methods.

  2. Some parts of the paper are difficult to understand without prior knowledge of orthogonal matching pursuit. Specifically, how is Fλ(s)=LmaxminwErrλ(w,S,L,θ)F\lambda(s) = L_{max} - min_w Err_{\lambda} (w, S, L, \theta) used in the OMP.

  3. If I understand correctly this method may not lead to the claimed reduction in complexity, as training QθQ_{\theta} and πϕ\pi_{\phi} till requires the full dataset.

Minor

The table references do not match the table numbers. On line 420, I believe the authors are referring to Table 1 instead of 6.2.

Suggestion : If the authors could include a notations table in Appendix it will help in readability and understanding the proofs.

References: [1] Kidambi, Rahul, et al. "Morel: Model-based offline reinforcement learning." Advances in neural information processing systems 33 (2020): 21810-21823. [2] Yu, Tianhe, et al. "Mopo: Model-based offline policy optimization." Advances in Neural Information Processing Systems 33 (2020): 14129-14142.

问题

Q1. How is the weight wiw_i or λ\lambda decided during training and the parameters LmaxL_{max}, mm,ϵ\epsilon chosen in practice?

Q2. Are the networks Qθ, πϕ networks first trained on the full dataset before starting with the subset selection?

Q3. What is the empirical reduction percentage achieved in each dataset?

Q4. In Figure 1 for the walker2d-expert-v0 environment, the reward first increases and then drops. It is also counterintuitive that the subset selected in ReDOR would perform better than a dataset containing only expert trajectories. Could the authors provide an explanation for this behavior?

Q5. Q5. Could the authors elaborate more on the Prioritize baseline, what do samples with highest TD Loss mean?

Q6. How does ReDOR perform on random datasets such as halfcheetah-random-v2?

Q7. I could not understand Fig 3. Why are the reduced dataset points more for category 6 when it is a subset of complete dataset?

评论

Dear Reviewer,

Thanks for finding our paper novel, interesting, strong theoretical study, and empirical analysis. We hope the following statement clear your concern.

W1: Short discussion on model-based methods like MOPO and MoERL.

A for W1: Thanks for your suggestions, we have corrected it in the revised version.

W2: How is Fλ(s)=LmaxminwErrλ(w,S,L,θ)F_{\lambda}(s)=L_{\rm max}- min_wErr_{\lambda}(w, S, L, \theta) used in the OMP.

A for W2: Since LmaxL_{\rm max} is a constant, maximizing Fλ(s)F_{\lambda}(s) is equivalent to minimizing Errλ(w,S,L,θ)Err_{\lambda}(w, S, L, \theta). Therefore, we adopt OMP to directly minimize Errλ(w,S,L,θ)Err_{\lambda}(w, S, L, \theta).

W3 and Q2: If the method leads to the claimed reduction in complexity and Are the networks QθQ_{\theta}, πϕ\pi_{\phi} networks first trained on the full dataset before starting with the subset selection?

A for W3 and Q2: Yes, the selected data subset reduces the computational complexity. Specifically, we first train QθQ_{\theta} and πϕ\pi_{\phi} on the full dataset before starting with the subset selection. Then, we load the pre-trained parameter θ,ϕ\theta, \phi to select the data subset and re-train offline RL methods on the reduced dataset from scratch. The experimental results on the paper show that reduced dataset speeds up the training process.

Minor:

  • The table references do not match the table numbers: Thanks for your suggestions, we have corrected it in the revised version.

  • Notations table in Appendix: Thanks for your suggestions, we have added the notation table in the revised version.

Q1: How is the weight wiw_i or λ\lambda decided during training and the parameters Lmax,m,ϵL_{\rm max}, m, \epsilon chosen in practice?

A for Q1: In practice, wiw_i and the indices ii are generated simultaneously by OMP. As for other parameters, we use uniform values across different tasks, λ=0.1,m=50,ϵ=0.01\lambda=0.1, m=50, \epsilon=0.01. Since we use the OMP method to directly minimize Errλ(w,S,L,θ)Err_{\lambda}(w, S, L, \theta), we do not need to set LmaxL_{\rm max}.

Q3: What is the empirical reduction percentage achieved in each dataset?

A for Q3: In our experiments, each dataset reduces by approximately 70% to 90%. Specifically, for Walker2d, there will be a reduction of 70%, while for Halfcheetah, there will be a reduction of 90%.

Q4: Why ReDOR performs better than a dataset containing only expert trajectories in Figure 1.

A for Q4: In the tasks depicted in Figure 1, we evaluate baselines on D4RL (Hard), which includes some suboptimal noisy data. We find that suboptimal data may lead to significant performance degradation due to distribution shifts. For this reason, ReDOR performs better than other baselines in the walker2d-expert-v0 task.

Q5: What do samples with highest TD Loss mean?

A for Q5: The Prioritize`Prioritize` baseline is designed based on the Prioritized Experience Replay for online RL [1]. Specifically, when the value function is updated with gradients, prioritizing data with larger loss values provides more information for policy learning. Inspired by this, in each round of data selection, we choose data with highest TD Loss as the data subset.

Q6: How does ReDOR perform on random datasets such as halfcheetah-random-v2?

A for Q6: As suggested, we conduct experiments on the random dataset, as shown in the Table 1. The experimental results show that the performance improvement on the random dataset is not significant, which is because there is a relatively small amount of high-quality data in the random dataset.

Complete DatasetReDOR
walker2d-random-v01.4±\pm1.62.4±\pm1.0
halfcheetah-random-v010.2±\pm1.312.2±\pm1.8
hopper-random-v011.0±\pm0.110.2±\pm0.3

Table 1. Experimental results on the random dataset.

Q7: Why are the reduced dataset points more for category 6 when it is a subset of complete dataset?

A for Q7: In Figure 3, each component reflects a distinct skill of the agent. The category 6 represents the stepping, which is relatively important compared with other skills, hence more points are allocated.

We sincerely thank the reviewer again for the timely and valuable comments. We hope that our response and additional experimental results have cleared most of your concerns.

[1] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.

评论

Thank you for the response. They have addressed my concerns. I am happy to recommend accept for this paper.

评论

We would like to thank the reviewer for raising the score to 8! We also appreciate the valuable comments, which helped us significantly improve the paper's strengths.

审稿意见
5

This paper introduces a method for dataset selection in offline reinforcement learning (RL) using the Orthogonal Matching Pursuit (OMP) algorithm and Monte Carlo Q losses. The proposed approach selects full trajectories whose loss gradients align well with the residuals.

优点

  1. The method demonstrates improved performance compared to the baselines.
  2. The paper includes a theoretical analysis that provides a solid grounding for the approach.

缺点

  1. Some key elements of the proposed algorithm are either missing or unclear, and there are some discrepancies between the paper and the accompanying codebase. For instance, the method used to generate "hard" datasets is not fully discussed in the paper, and the percentile mm mentioned in the paper differs from that in the codebase. More details are provided in the questions below.

  2. Certain parts of the proposed algorithm may contain logical errors or inconsistencies. For example, in Line 4 of Algorithm 2, rr is a scalar, yet an inner product operation is applied to it. More details are provided in the questions below.

  3. The baselines chosen for comparison seem somewhat outdated, which could affect the perceived significance of the performance improvements demonstrated by the proposed method.

问题

  1. Could you please clarify how the suboptimal datasets for MuJoCo, namely "hard", were generated? The paper mentioned that they were generated by adding low-quality data, but the quality or source of such data and mix ratio should also be introduced.

  2. Regarding QθQ_\theta in Algorithm 1, could you explain how QθtQ_{\theta_t} was formulated? There is no update term in either the pseudocode or the codebase. Was QθQ_\theta pretrained or trained simultaneously but omitted? It would be best if the pseudocode or thorough explanations were provided.

  3. In Equation 14, it is stated that trajectories in the top mm% based on return are filtered, with mm set to 50, which would seem to exclude almost the entire random dataset. Could you provide the result of simply selecting trajectories with top m(=50)%m (=50)\% returns for comparison?

  4. In the codebase, it seems that in addition to the evaluation of Monte Carlo Q targets, the selection of candidate trajectories via OMP is filtered based on trajectory returns. What is the exact search space of the selected trajectories? If it is the filtered one with trajectory returns, then how can we ensure the fairness of the comparison to baselines that do not utilize such a filter?

  5. In the paper, the percentile mm is specified as 50 (Top), but in the codebase, it varies (Bottom 50, 70, and 95). Could you clarify the reason for this difference?

  6. In Algorithm 2, rr is defined as a scalar, but in Line 4, an inner product is applied. Could you kindly explain this?

  7. In Line 3 of Algorithm 2, the inequality appears to be reversed. Is this correct?

  8. Is there a reason why TD3+BC was chosen as the backbone offline RL algorithm for the MuJoCo tasks? Would using IQL, as in the Antmaze tasks, provide a more consistent comparison?

  9. For the MuJoCo tasks, the authors used the "-v0" versions, which are now outdated and differ from the more recent "-v2" versions. Could you explain the reasoning behind using "-v0"?

  10. For the "Complete Dataset" scores in the Antmaze tasks, it seems that these values are taken from the IQL paper, which does not provide standard deviations. Could you clarify how these scores were derived?

  11. While the baselines used in the experiments appear somewhat dated, dataset selection has recently gained increased attention in offline RL. Hence, it seems that recent algorithms should be contained as baselines. For example, "Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting (Wang et al., 2024)" provides a codebase, which could allow for a straightforward comparison. Or, is there any reason why such comparisons are inappropriate?

  12. Could you provide more details on what is meant by the "Complete Dataset" baseline? Specifically, is it the original mixture of the desired dataset and the suboptimal dataset, or is it just the original dataset?

评论

Q9: Why use -v0 version?

A for Q9: TD3+BC algorithm in the original paper was evaluated on the -v0 version, so we also conduct experiments on the -v0 version, for no other reason.

Q10: The standard deviations of IQL.

A for Q10: We run the official code released by the IQL authors, supplementing the missing standard deviation in the paper.

Q11: Recent baselines in dataset selection of offline RL.

A for Q11: The baselines considered in this paper are common approach to data selection in supervised learning. This is also why we chose these methods as our baselines. As suggested, we evaluate the recent baselines (ADS [1]) in the D4RL (Hard) tasks. The experimental results in Table 4 show that ReDOR achieves better performance than ADS in most tasks. We thank the suggestions of the reviewer, and we will place the complete experimental results and discussion in the revised version later.

D4RL (Hard)ADSReDOR
walker2d-medium-v078.9±\pm1.580.5±\pm2.9
halfcheetah-medium-v037.3±\pm0.641.0±\pm0.2
hopper-medium-v091.4±\pm2.894.3±\pm4.6
walker2d-medium-replay-v016.4±\pm2.921.1±\pm1.8
halfcheetah-medium-replay-v040.4±\pm0.641.1±\pm0.4
hopper-medium-replay-v031.9±\pm2.735.3±\pm3.2
walker2d-expert-v098.9±\pm2.8104.6±\pm2.5
halfcheetah-expert-v088.2±\pm1.388.5±\pm2.4
hopper-expert-v0104.2±\pm0.6110.0±\pm0.5

Table 4. Comparison with ADS in the D4RL (Hard) tasks.

Q12: More details on what is meant by the "Complete Dataset" baseline.

A for Q12: In the original D4RL tasks, the complete dataset baseline denote the original dataset. As for the D4RL (Hard) tasks, the complete dataset denote the original mixture of the desired dataset and the suboptimal dataset.

Thanks again for the valuable comments. We hope our additional experimental results and explanation have cleared the concern. We sincerely hope that the reviewer can re-evaluate our paper after seeing the our response. More comments on further improving the presentation are also very much welcomed.

[1] Wang, Da, et al. "Improving Generalization in Offline Reinforcement Learning via Adversarial Data Splitting." Forty-first International Conference on Machine Learning.

评论

Dear Reviewer,

Thanks for your valuable comments. We hope the following statement can address your concern.

W1 and Q1: How to generate hard datasets.

A for W1 and Q1: We added noise data generated by various behavioral policies to the original dataset to simulate the noise in data collection in the real world. In each dataset, the size of the added noise data is 20% of the original dataset.

Q2: How was QθQ_{\theta} formulated?

A for Q2: We first train QθQ_{\theta} on the full dataset before starting with the subset selection. Then, we load the pre-trained parameter θ\theta to select the data subset and re-train offline RL methods on the reduced dataset from scratch.

Q3: The results of simply selecting trajectories with top returns for comparison.

A for Q3: As suggested, we conduct ablation study in the D4RL (Hard) tasks by simply filtering the dataset by high returns (m=50m=50). The experimental results in Table 1 show that simply filtering data with high returns is not sufficient to achieve good performance. This is because the task cannot be well-defined if only given the expert demos. Diverse trajectories can help specify the boundary of the task.

D4RL (Hard)Simply FilteringReDOR
walker2d-medium-v065.4±\pm3.680.5±\pm2.9
halfcheetah-medium-v030.5±\pm2.441.0±\pm0.2
hopper-medium-v091.2±\pm3.294.3±\pm4.6
walker2d-medium-replay-v013.4±\pm1.621.1±\pm1.8
halfcheetah-medium-replay-v040.2±\pm0.341.1±\pm0.4
hopper-medium-replay-v031.0±\pm3.135.3±\pm3.2
walker2d-expert-v086.4±\pm2.6104.6±\pm2.5
halfcheetah-expert-v087.2±\pm5.388.5±\pm2.4
hopper-expert-v0110.3±\pm0.1110.0±\pm0.5

Table 1. Comparison with the simply filtering baseline in the D4RL (Hard) tasks.

Q4: What is the exact search space of the selected trajectories?

A for Q4: To balance the data quantity with performance, we aim to focus on data points that are aligned with the learned policy, avoiding performance degradation caused by suboptimal trajectories. As suggested, we conduct additional experiments by adding the same filter module in the baselines. The experimental results in Table 2 show that ReDOR still performs better than baselines equipped with the filter module. This is because the reduced dataset can eliminate a large amount of redundant data, making the learning process of the algorithm more efficient.

D4RL (Hard)Random (Filter)Prioritized (Filter)ReDOR
walker2d-medium-v052.4±\pm2.760.6±\pm4.580.5±\pm2.9
halfcheetah-medium-v025.7±\pm1.232.4±\pm0.441.0±\pm0.2
hopper-medium-v090.7±\pm3.992.2±\pm2.794.3±\pm4.6
walker2d-medium-replay-v011.4±\pm2.416.3±\pm1.421.1±\pm1.8
halfcheetah-medium-replay-v027.2±\pm0.628.9±\pm0.841.1±\pm0.4
hopper-medium-replay-v022.6±\pm2.534.4±\pm1.735.3±\pm3.2
walker2d-expert-v079.3±\pm3.289.7±\pm2.9104.6±\pm2.5
halfcheetah-expert-v079.6±\pm1.968.1±\pm3.388.5±\pm2.4
hopper-expert-v0108.4±\pm0.8109.4±\pm0.9110.0±\pm0.5

Table 2. Comparison with baselines in the D4RL (Hard) tasks.

Q5 and W1: The selection of mm.

A for Q5 and W1: We thank the reviewer for raising the point. Due to poor communication among collaborators, the initial codebase was inadvertently uploaded.

Q6 and W2: The inner product operation in Algorithm 2.

A for Q6 and W2: rr is not scalar. In practice, we use the last layer of the gradients for neural networks and rr denotes the residual error, which is a vector and the shape is (\cdot,257).

Q7: The inequality in Algorithm 2 appears to be reversed.

A for Q7: Thanks for your suggestions, we have corrected it in the revised version.

Q8: Is there a reason why TD3+BC was chosen as the backbone offline RL algorithm for the MuJoCo tasks? and Would using IQL, as in the Antmaze tasks, provide a more consistent comparison?

A for Q8: TD3+BC algorithm is one of the well-known algorithms in the offline RL community, and we have chosen TD3+BC as the backbone. As suggested, we conduct additional experiments in MuJoCo tasks based on IQL. The experimental results in Table 3 show that IQL can still be adapted to Mujoco tasks with ReDOR.

D4RL (Hard)ReDOR (IQL)ReDOR
walker2d-medium-v071.3±\pm2.180.5±\pm2.9
halfcheetah-medium-v040.9±\pm0.341.0±\pm0.2
hopper-medium-v096.4±\pm2.394.3±\pm4.6
walker2d-medium-replay-v023.6±\pm3.621.1±\pm1.8
halfcheetah-medium-replay-v043.1±\pm0.941.1±\pm0.4
hopper-medium-replay-v039.1±\pm4.435.3±\pm3.2
walker2d-expert-v0106.6±\pm1.4104.6±\pm2.5
halfcheetah-expert-v0108.2±\pm2.788.5±\pm2.4
hopper-expert-v0110.5±\pm0.4110.0±\pm0.5

Table 3. Experimental results of ReDOR (IQL) in the D4RL (Hard) tasks.

评论

Dear Reviewer,

We have conducted additional experiments on ablations and baselines. We are wondering if our response and revision have cleared your concerns. We would appreciate it if you could kindly let us know whether you have any other questions. We are looking forward to comments that can further improve our current manuscript. Thanks!

Best regards,

The Authors

评论

Thank you for your clarification. Most concerns regarding implementation are addressed. I adjusted my score accordingly.

评论

We would like to thank the reviewer for raising the score! We also appreciate the valuable comments, which helped us significantly improve the paper's strengths.

审稿意见
5

Motivated by the large size of the offline dataset as well as suboptimal data quality in offline RL, this paper considers the problem of finding a coreset out of the given dataset. The authors first formulate such problem as a task to approximate the actual gradients (from the complete dataset) in the offline training process. And a line of of results are provided to support the low approximation errors. Then the method named Reduced Datasets for Offline RL (REDOR) is proposed, inspired by the orthogonal matching pursuit (OMP). Finally, the method is compared with several baseline methods on D4RL data.

优点

Originality

  • Such new method is proposed to select a coreset from the raw offline dataset, which could contribute as an alternative approach in offline RL.

Clarity

  • Several informative figures are provided. Especially the one by t-SNE provides a straightforward way to understand the behaviour of such selection process.

Significance

  • In some of the settings concerned in the experiments, such method is quite efficient.

缺点

Quality

  • Several assumptions in Theorem 4.1 are rather stronger than scenarios in actual implementations. One observation often seen in offline RL is the diverging gradients (if without proper training techniques), which, however, are assumed to be uniformly bounded in the paper, w.r.t parameters in respectively policies and Q-functions.
  • Despite the multi-round selection strategy introduced in Section 4.2, as long as the empirical returns are used, as depicted in equation (13), the targets in training steps are relatively fixed (in the sense of distributions due to behaviour policies), which then makes (13) no longer an approximation of Bellman backup errors. As a result, it is currently not clear if such approach would lead to a guaranteed good estimation of values/Q-functions.
  • According to what the reviewer can understand about the statements and proof for results in Section 5, the theorems only consider the proposed method defined with classic TD loss, while do not consider the techniques emphasized in Section 4.2 - 4.3. As a result, such theoretical discussion is not an actual analysis of the proposed algorithm (feel free to correct me).
  • In Line 766, within the proof for Theorem 5.2, it is not justified why SkˆS\^k can always start from the cluster center c_k{c\_k} of gradients.

Clarity

  • According to the way a Q-function is defined in Line 99, some index of tt should be included in the notation of QQ.
  • Horizon HH is not explicitly defined.
  • There is not enough information for L_maxL\_{\text{max}}.
  • There lacks for an introduction to how KRLS, Log-Det and BlockGreedy are implemented in such offline RL settings.

Significance

  • As explained in the 'Quality' part, the theoretical results seem not to be exactly for the proposed method.

问题

None

评论

Dear Reviewer,

Thanks for your valuable and detailed comments. We hope the following statement clear your concern.

W1: The assumptions in Theorem 4.1 are rather stronger than scenarios in actual implementations.

A for W1: We thank the reviewer for raising the point. If the gradients of the algorithm diverge in practice, it would contradict our assumptions, and the selected data subset would no longer be valuable. However, there are currently various empirical techniques [1,2,3] that can overcome this issue, ensuring that the algorithm's gradients remain stable.

W2: If such approach would lead to a guaranteed good estimation of values/Q-functions.

A for W2: In practice, we found that if we use the standard Bellman backup errors, the gradients used for data selection can be unstable. On the other hand, if we use the relatively fixed target, it cannot lead to the good estimation of Q-functions. To address this issue, we first train QθQ_{\theta} and πϕ\pi_{\phi} on the full dataset based on the standard Bellman backup errors before starting with the subset selection. Then, we load the pre-trained parameter θ,ϕ\theta, \phi to select the data subset based on Equation 13, which can ensure relatively accurate and stable gradients.

W3: The theorems only consider the proposed method defined with classic TD loss.

A for W3: The theoretical analysis in Section 5 provides conclusions when the gradient approximation error is ϵ\epsilon. On the other hand, the techniques in Sections 4.2-4.3 can guarantee the gradient approximation error is lower than ϵ\epsilon. Therefore, the theoretical conclusion in Section 5 can be adapted into our algorithm.

W4: Why SkS^k can always start from the cluster center of gradients.

A for W4: This is the assumption in our theoretical analysis. However, this assumption is not difficult to achieve in practice. We can first cluster the dataset and using the cluster center as the initial point before selecting the data subset within each cluster.

C1: Some index tt of should be included in the notation of QQ.

A for C1: Thanks for your suggestions, we have corrected it in the revised version.

C2: Horizon HH is not explicitly defined.

A for C2: Thanks for your suggestions, we have corrected it in the revised version.

C3: There is not enough information for LmaxL_{\rm max}.

A for C3: Since LmaxL_{\rm max} is a constant, maximizing Fλ(s)F_{\lambda}(s) is equivalent to minimizing Errλ(w,S,L,θ)Err_{\lambda}(w, S, L, \theta). Therefore, we adopt OMP to directly minimize Errλ(w,S,L,θ)Err_{\lambda}(w, S, L, \theta) and do not need to set LmaxL_{\rm max}.

C4: There lacks for an introduction to how KRLS, Log-Det and BlockGreedy are implemented in such offline RL settings.

A for C4: We implement KRLS, Log-Det and BlockGreedy based on the standard Bellman backup errors, and the specific data selection process remains consistent with the original paper.

We sincerely thank the reviewer again for the timely and valuable comments. We hope that our response have cleared most of your concerns.

[1] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.

[2] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.

[3] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.

评论

Dear Reviewer,

We have added additional explanations for our methods. We are wondering if our response and revision have cleared your concerns. We would appreciate it if you could kindly let us know whether you have any other questions. We are looking forward to comments that can further improve our current manuscript. Thanks!

Best regards,

The Authors

评论

Thanks for the responses! Please see my follow-ups:

If the gradients of the algorithm diverge in practice, it would contradict our assumptions, and the selected data subset would no longer be valuable. However, there are currently various empirical techniques [1,2,3] that can overcome this issue, ensuring that the algorithm's gradients remain stable.

  • First of all, in the current proposed method it has not been clearly introduced how such boundedness is guaranteed. In addition, either empirically or theoretically, it would be better if one can verify those boundedness assumptions are satisfied. Otherwise such strong assumptions may restrict the capabilities of the method.

A for W2

  • So it means the target is actually biased? (not following the Bellman operator)?

  • If some pre-training on whole dataset is needed, then the advantage of acceleration could not hold?

the techniques in Sections 4.2-4.3 can guarantee the gradient approximation error is lower than epsilon\\epsilon

  • I see the techniques, but not the justification. For example, both theorems explicitly use the term 'TD loss', then why is it the case in Equation (14), that is not the TD loss?
评论

Dear Reviewer,

Thanks for your reply! We will address your follow-up questions below.

Q1: The boundedness of gradient in offline RL is not guaranteed.

A for Q1: We agree with the comment of the Reviewer. Current offline RL methods can only ensure that the Q-values do not diverge. Although this can, to some extent, reflect the gradients of the Q-network that have not diverged, there is no rigorous proof that the bounds of the gradients can be guaranteed. We appreciate the Reviewer pointing out this issue, and we have added a subsection in the revised paper (l316-l323), taking your suggestion as a limitation of the theoretical analysis, thereby providing valuable insights for future research.

Q2.1 and Q3: The target is actually biased, and why is it the case in Equation (14), that is not the TD loss?

A for Q2.1 and Q3: Compared to updates in standard RL methods, the targets we use are indeed biased. However, the reason we use Equation 14 instead of TD Loss is to provide a more consistent learning signal and mitigate instability caused by changing target values, thereby making the selected data more valuable. We thank you again for your valuable comments, and we have highlighted this point in the discussion of limitations in our revised paper (l316-l323). Moreover, we conduct additional ablation studies by replacing the empirical returns module in ReDOR with the standard TD Loss (ReDOR (TD Loss)). The experimental results in Table 1 show that the empirical returns module is necessary.

D4RL (Hard)ReDOR (TD Loss)ReDOR
walker2d-medium-v020.4±\pm3.180.5±\pm2.9
halfcheetah-medium-v038.4±\pm0.341.0±\pm0.2
hopper-medium-v042.1±\pm2.394.3±\pm4.6
walker2d-medium-replay-v011.4±\pm1.221.1±\pm1.8
halfcheetah-medium-replay-v020.2±\pm0.341.1±\pm0.4
hopper-medium-replay-v013.0±\pm1.135.3±\pm3.2
walker2d-expert-v03.4±\pm0.2104.6±\pm2.5
halfcheetah-expert-v080.2±\pm0.588.5±\pm2.4
hopper-expert-v090.3±\pm5.4110.0±\pm0.5

Table 1. Ablation about the empirical returns module in the D4RL (Hard) tasks.

Q2.2: If some pre-training on whole dataset is needed, then the advantage of acceleration could not hold?

A for Q2.2: Although pre-training is required on the whole dataset, when we need to retrain the algorithm (e.g., parameter tuning), we can significantly reduce our training time on the reduced dataset.

Best,

The Authors

评论

Dear Reviewer,

Thanks for your reply! We hope the following statement can clear your remaining concerns.

Boundedness issue of gradients.

A: Thanks again for your suggestions. We also appreciate the valuable comments, which helped us significantly improve the paper's strengths.

Using TD loss or not.

A: Yes. In the TD loss version, all pre-training, similar to the primary procedure, has also been used besides using TD loss.

Training cost.

A: As suggested, we conduct additional experiments for the computational cost between different approaches (with or without pre-training). Specifically, we train TD3+BC and IQL on the reduced dataset generated by ReDOR (named TD3+BC (Reduce) and IQL (Reduce)). On the other hand, we train TD3+BC and IQL on the standard datasets (named TD3+BC (original) and IQL (original)). We record the time the algorithms reach the same performance. The experimental results in Table 1 show that the computational cost spent training on the reduced dataset is significantly lower than on the original dataset.

Note that the results for IQL are generally faster than those for TD3+BC because we are using the official code for IQL, implemented with JAX, which is faster than PyTorch. All experiments are conducted on the GeForce RTX 3090 GPU device. In the revised version later, we will conduct an empirical comparison between more approaches.

D4RLTD3+BC (Reduce)TD3+BC (original)IQL (Reduce)IQL (original)
walker2d-medium-v047m81m23m38m
halfcheetah-medium-v046m81m21m35m
hopper-medium-v033m67m22m36m
walker2d-medium-replay-v044m78m21m34m
halfcheetah-medium-replay-v027m49m23m37m
hopper-medium-replay-v025m45m21m35m
walker2d-expert-v049m87m20m32m
halfcheetah-expert-v056m93m26m41m
hopper-expert-v014m24m14m20m

Table 1. Computational cost between different approaches (with or without pre-training). m denotes the minute.

Best,

The Authors

评论

Thanks for the details.

Boundedness issue of gradients

  • I appreciate that the authors explicitly clarify such issue as a limitation in the corresponding part of the paper.

Using TD loss or not

  • The result attached looks interesting, roughly showing that consistency in such setting may not always be ideal. Just want to confirm, when adopting the TD loss version, i.e. the first column in the table, some pre-training similar as the primary procedure has also been used?

Training cost

  • It's ok not to have such results currently, but could help for some later revision if an empirical comparison between different approaches (with or without pre-training) is provided.
评论

Dear Reviewer,

Thank you for your thoughtful feedback on our paper. With only two days remaining in the discussion period, we kindly ask that you review our responses to ensure we have fully addressed your concerns. If you find our responses satisfactory, we would greatly appreciate it if you could reconsider your rating/scoring.

Your engagement and constructive input have been invaluable, and we truly appreciate your time and effort in supporting this process.

Best regards,

Authors

评论

The reviewer appreciates the discussion and clarification. As highlighted in the comments, the current theoretical analysis relies on relatively strong assumptions, and the results do not yet fully align with the complete set of techniques proposed. With further investigation and refinement, the paper has the potential to provide deeper insights into these problem settings.

AC 元评审

The paper discusses subset selection for offline RL datasets to improve the performance of RL methods. The reviewers think that the paper is well written and the idea is easy to follow. The paper present both theory and experiments for the proposed method.

In terms of weaknesses, there seems to be a gap between the proposed theory and the actual experiments, in the sense that the theoretical results seem not to be for the proposed method. The baselines used for comparison are also a bit outdated.

审稿人讨论附加意见

Some of the reviewers concerns were addressed during the rebuttal.

最终决定

Accept (Poster)