Federated Ensemble-Directed Offline Reinforcement Learning
A novel federated offline reinforcement learning algorithm
摘要
评审与讨论
This paper proposed the Federated Ensemble-Directed Offline Reinforcement Learning Algorithm. The combination of offline RL and federated learning is interesting in addressing the training data insufficiency issue due to small pre-collected datasets.
优点
The originality of this paper is relatively good, since the proposed Federated Ensemble-Directed Offline Reinforcement Learning Algorithm is effective in offline reinforcement learning. The quality and clarity are also clear, and this paper is actually well-written. The significance of this paper is obvious, because offline reinforcement learning is important in real-world scenarios.
缺点
- Some technical details need to be explained. For example, the ensemble learning and its role.
- The novelty of this paper needs further clarification, and what is the main difference between this proposed method and existing studies? It seems that there is only a simple combination of two technologies.
- Numerically, the authors could consider comparing their method with more baselines. There are some studies on federated learning for offline RL.
问题
- Some technical details need to be explained. For example, the ensemble learning and its role.
- The novelty of this paper needs further clarification, and what is the main difference between this proposed method and existing studies? It seems that there is only a simple combination of two technologies.
- Numerically, the authors could consider comparing their method with more baselines. There are some studies on federated learning for offline RL.
局限性
- What is the technical drawback of the proposed method? E.g., the effectiveness of the agent weight by ensemble approach
- Does this proposed method work for other RL algorithms?
We thank the reviewer for their valuable feedback. We are delighted to know that the reviewer finds our work original, our problem significant, and our paper well written. Below, we address the reviewer's concerns and hope they will consider increasing their score.
1. Some technical details need to be explained. For example, the ensemble learning and its role.
Response: The idea of ensemble learning in our algorithm is to use the data distributed among different clients to learn a federated policy collectively. In Section 4.1, we have mentioned that `ensemble heterogeneity' is one of the key challenges in federated offline RL. In our work, we proposed an ensemble approach to overcome this challenge. Specifically, our approach uses the performance of the local policies as a proxy to weigh each client's contribution to the federated policy. This ensures that policies from clients with higher performance data have a greater influence on the federated policy. We have explained our approach in Section 5.1, including the mathematical equation that translates our idea to an actual algorithmic step. In addition to the experimental results given in Section 6 that show the superior performance of our FEDORA algorithm, we have also included additional ablation experimental results in Appendix C.1 which show the significance of our ensemble method w.r.t. other ingredients of FEDORA. Please let us know if there are any specific aspects that need to be elaborated further.
2. The novelty of this paper needs further clarification, and what is the main difference between this proposed method and existing studies? It seems that there is only a simple combination of two technologies.
Response: We respectfully disagree with the reviewer's comment that our proposed algorithm is ``is only a simple combination of two technologies''. We emphasize that a simple combination of federated learning and offline RL is insufficient, as we have explained in our paper (see Fig. 1 and the explanation there). Significant algorithmic innovations are necessary to overcome some unique challenges of federated offline RL, as we have explained in Section 4.1. Our contributions include four key innovations: Ensemble-Directed Learning over Client Policies (Section 5.1), Federated Optimism for Critic Training (Section 5.2), Proximal Policy Update for Heterogeneous Data (Section 5.3), and Decaying the Influence of Local Data (Section 5.4). We have given detailed experimental evidence on the superior performance of this method, see Section 6 and Appendix. Moreover, we have done ablation experiments that show the importance of each of these proposed innovations.
3. Numerically, the authors could consider comparing their method with more baselines. There are some studies on federated learning for offline RL
Response: We have demonstrated the superior performance of our FEDORA algorithm against four different baseline algorithms through simulation experiments, see Section 6.1 and Appendix C. We have also evaluated the performance of FEDORA in the real-world using TurtleBot, a two-wheeled differential drive mobile robot, see Section 6.1, and compared it with the same baseline algorithms. We have also included a video of this real-world demonstration. We sincerely believe that these experiments and real-world demo clearly show the superior performance of FEDORA against the standard baselines. We would appreciate further guidance on specific baselines the reviewer would like us to include in our comparisons.
4. What is the technical drawback of the proposed method? E.g., the effectiveness of the agent weight by ensemble approach
Response: We address our limitations in Appendix E. We make the assumption that all clients have the same MDP model (transition kernel and reward model), and any statistical variances between the offline datasets are due to differences in the behavior policies used to collect the data. In future works, we aim to broaden this to cover scenarios where clients have different transition and reward models.
Reading the second remark about the effectiveness, please note that we have already included detailed ablation experiments to analyze the effectiveness of different components of our algorithm, see Appendix C1 and C2.
5. Does this proposed method work for other RL algorithms?
Response Indeed! The FEDORA framework that we propose is general and can work with any actor-critic-based offline RL algorithms. We have mentioned this in our paper, please see lines 126-128.
Thanks for the authors' efforts in the detailed response. My concern has been addressed. I will raise my score to weak accept.
The authors identify fundamental challenges for Federated Offline Reinforcement Learning and present Fedora, an approach that tackles each of them. They perform extensive evaluation of the approach on Mujoco and real-world datasets showing improved performance over existing work.
优点
The paper is well-written and, importantly, the code has been shared. The authors run extensive experiments. The work is novel and the notion of federated optimism is particularly interesting. Federated offline RL is an important research area with vast real-world applicability. The algorithm has been shown to be robust to diverse/ heterogenous client datasets. It is also commendable that the approach was tested on a real-world robot.
缺点
No theoretical guarantees have been given for the algorithm though it does build upon foundational work. I believe that the authors should explicitly discuss limitations/ opportunities for future work in the paper. It is important for the algorithm pseudocode to be included in the main material as is the norm in such papers. I believe that there are perhaps many experiments included the main paper meaning that the discussion/ hypotheses for results is somewhat diluted. Another minor issue is that the figures are placed very far away from where they are referred to in text.
问题
-
How far do the authors perceive that this model can be pushed? i.e. the assumption that all clients have the same MDP is restrictive but understandable for a first set of experiments.
-
Have any experiments been run using D4RL-random datasets? It would be interesting to see whether this collapses learning. With regards to FEDORA outperforming centralised training I think a deeper discussion on this would be useful.
-
What is the main reason for this? Heterogenous data though previous work has successfully mixed datasets: https://arxiv.org/abs/2106.06860
局限性
Limitations should be explicitly stated. I feel that the authors could give a more balanced view of the algorithm by not only showing strengths but also assessing the limits of the work.
We thank the reviewer for their comments and are happy to note that they find our work novel, our experiments extensive, and our paper well-written. Below, we address their concerns and hope that they consider increasing their score.
1. No theoretical guarantees have been given for the algorithm though it does build upon foundational work.
Response: Thank you for your comment. Providing a theoretical guarantee for our FEDORA algorithm is indeed a challenging problem that requires technical analysis of multiple complicated components, including offline policy evaluation, pessimistic estimation, ensemble-style quality-based federation, dealing with heterogeneous data, and analyzing these components jointly to derive the final performance bound.
We, however, emphasize that, to the best of our knowledge, ours is the first paper on federated offline deep RL with an algorithm that performs really well in a variety of settings. Our design is analytically driven, identifies each issue of learning from the ensemble of policies, and builds up the algorithm methodically one step at a time corresponding to each analytical insight. We sincerely believe that our rigorous empiricism-driven approach is valuable based on its algorithmic contributions.
2. I believe that the authors should explicitly discuss limitations/ opportunities for future work in the paper.
Response: We discuss the limitations and future directions of our research in Appendix E. For the final submission, we will move this discussion to the main part of the paper to make it more accessible and prominent.
3. It is important for the algorithm pseudocode to be included in the main material as is the norm in such papers...Another minor issue is that the figures are placed very far away from where they are referred to in text.
Response: We pushed the pseudocode to the appendix due to space constraints. However, for the final version, we will move the pseudocode back to the main text, as we will have an extra page for the final submission. Additionally, we will adjust the placement of figures to be closer to the relevant text.
4. How far do the authors perceive that this model can be pushed? i.e. the assumption that all clients have the same MDP is restrictive but understandable for a first set of experiments
Response: We plan to extend FEDORA to a meta federated learning setting, wherein we can learn with clients having different transitions and reward functions. This extension is discussed in Appendix E, and we aim to explore this direction in future work.
5. Have any experiments been run using D4RL-random datasets? It would be interesting to see whether this collapses learning. With regards to FEDORA outperforming centralized training I think a deeper discussion on this would be useful.
Response: Yes, we have run experiments using the D4RL random dataset and compared it with centralized training (See Figure 3 in Section 6.1). We also conduct experiments with clients having different datasets (including random datasets) in Appendix C.6.
6. What is the main reason for this? Heterogeneous data though previous work has successfully mixed datasets
Response: The use of heterogeneous datasets in centralized offline RL is a significant challenge. One reason for the drop in performance when pooling data from behavior policies with different expertise levels is that it can exacerbate the distributional shift between the learned policy and the individual datasets, leading to poor performance [1].
References
[1] Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Sergey Levine, and Chelsea Finn. Conservative data sharing for multi-task offline reinforcement learning. Advances in Neural Information Processing Systems, 34:11501–11516, 2021.484
Thank you for engaging with the review. I think a brief discussion of the limitations should be in the main paper. Please ensure that all other promised changes are made: I think that my comments can be used to move around some important content into the main paper. If the concerns are addressed then I will stick to my original score.
This paper presents the Federated Ensemble-Directed Offline Reinforcement Learning Algorithm (FEDORA), a novel approach for collaborative learning of high-quality control policies in a federated offline reinforcement learning (RL) setting. The paper identifies key challenges in federated offline RL, including ensemble heterogeneity, pessimistic value computation, and data heterogeneity. To address these issues, FEDORA estimates the performance of client policies using only local data and, at each round of federation, produces a weighted combination of the constituent policies that maximize the overall offline RL objective, while maximizing the entropy of the weights. Besides the core idea, FEDORA also performs data pruning
优点
-
This is a novel work proposing the first federated offline RL algorithm in the general case (without assuming linearity). The paper is very well written with clear motivations and detailed discussions on the insufficiency of existing, naive approaches.
-
The experiments are also very thorough and convincing with experiments ranging from simple 2D environments to high-dimensional continuous control problems. The algorithm is also tested on a real-world robot platform, which is very impressive given the density of algorithmic contributions in the paper.
缺点
-
"Collect wisdom" can be replaced by more rigorous exposition. Same goes with "ambitious targets".
-
The number of communication rounds needed for FEDORA to converge is still quite high.
-
Given how well the algorithm does, some sort of theoretical analysis could further strengthen the work.
问题
My questions are stated above.
局限性
Yes, limitations are adequately discussed.
We thank the reviewer for their positive endorsement of our work. We are happy to know that the reviewer finds our work novel, our experiments extensive and our paper well written. Below we address the concerns of the reviewer.
1. "Collect wisdom" can be replaced by more rigorous exposition. Same goes with "ambitious targets".
Response: We thank the reviewer for their suggestion, we will incorporate this in the final version of the paper.
2. The number of communication rounds needed for FEDORA to converge is still quite high.
Response: The number of communication rounds that FEDORA takes depends on factors such as the complexity of the problem and the number of local epochs performed during each round of federation. We believe that the number of communication rounds can be reduced by increasing the number of local epochs performed in each round of federation.
3. Given how well the algorithm does, some sort of theoretical analysis could further strengthen the work.
Response: Thank you, we are indeed working on this problem. Providing a theoretical guarantee for our FEDORA algorithm is a challenging problem that requires technical analysis of multiple complicated parts corresponding to offline policy evaluation, pessimistic estimation, ensemble-style quality-based federation, dealing with heterogeneous data, and analyzing these components jointly to get the final performance bound.
Joint Response
We would like to express our gratitude to all the reviewers for their time and feedback. We are delighted that the reviewers recognize the novelty of our work (hv6n, EkXX, EGVU), find our paper well-written (hv6n, EkXX, EGVU), and appreciate the comprehensiveness of our experiments (hv6n, EkXX). Below, we provide detailed responses to their queries. We look forward to a productive discussion during the reviewer-author period.
A summary of the strengths and weaknesses based on the reviews and rebuttal (including follow-up discussion and that among the reviewers) is provided below:
STRENGTHS
-
This paper proposes a novel federated offline RL algorithm in the general case (without assuming linearity).
-
The paper is well-written.
-
The experiments are extensive and compelling and include that of a real robot platform.
WEAKNESS
The paper lacks a theoretical performance analysis of the algorithm, for which the authors acknowledge would be challenging due to multiple components being involved. However, the reviewers think that the technical contributions of this work (listed under the strengths) outweigh this weakness.
As a minor comment, I like to suggest that the authors include other references on federated reinforcement learning, especially those published in the 3 flagship ML conferences.
The authors are encouraged to revise their main paper based on the above feedback and that of the reviewers, as well as based on their rebuttal (e.g., limitations).