Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning
摘要
评审与讨论
This paper introduces Collaborative World Models (CoWorld), a novel approach to offline reinforcement learning (RL) with visual inputs. CoWorld conceptualizes offline RL as an online-to-offline transfer learning problem, utilizing an auxiliary online simulator to address the challenges of overfitting in representation learning and value function overestimation. The method involves training separate source and target world models and actor-critic agents, aligning their latent spaces, and using a three-stage learning procedure to improve performance on the target task.
优点
- The idea is novel: CoWorld presents an interesting perspective on offline RL by framing it as an online-to-offline transfer learning problem.
- The problem is important: The method aims to tackle both overfitting in representation learning and the trade-off between value function overestimation and over-conservatism.
- Ablation studies and Analysis: The paper includes comprehensive ablation studies that validate the contribution of each component in the CoWorld framework.
缺点
- Dependence on auxiliary online environments: CoWorld heavily relies on the availability of an online environment in a sufficiently similar domain, which may limit its applicability in real-world scenarios.
- Unclear rationale for separate world models: The decision to maintain separate source and target world models, rather than a single jointly trained model, is not adequately explained or supported.
- Marginal performance improvement: The performance gains of CoWorld over simpler baselines (e.g., fine-tuning) are sometimes marginal and highly sensitive to hyperparameters.
- Computational complexity: The alternation between online and offline agents until convergence may be computationally expensive and potentially unstable.
- Strong assumptions: The requirement of a high-quality simulator for a similar source domain contradicts the motivation for offline RL, which assumes online data collection is costly or dangerous.
- Insufficient clarity in the presentation: The necessity and function of individual components—such as world model learning, state alignment, and reward alignment—are not clearly delineated, leading to potential confusion.
问题
- How dependent is CoWorld's performance on the quality and similarity of the online simulator to the target task?
- Have you considered training on multiple source domains to improve robustness and generalization?
- How does the computational complexity of CoWorld compare to baseline methods in terms of training time and resource requirements?
- Can you provide more insight into the process of selecting an appropriate source domain, and how this might be done without leaking test-time information?
- How would CoWorld perform if the source and target domains have significantly different observation spaces?
局限性
Yes, the authors have discussed the limitations in the submission.
Q1: The availability of an online environment.
There are many real-world scenarios (such as robot control, autonomous driving, and healthcare) where existing simulators can be employed as source domains. See General Response (Q1) for details.
Further, our method is specifically designed to address cross-domain inconsistencies in dynamics, visual appearances, and action space (Table 1), thereby reducing the necessity for a very similar online environment.
Q2: How dependent is CoWorld on the quality of the online simulator?
In Fig 3 (left) of the original paper, we evaluate our model's performance across each pair of source and target domains. We observe positive transfer effects (indicated by the red blocks) in 18/20 single-domain transfer scenarios. Indeed, we also observe negative transfers (blue blocks) in 2/20 scenarios.
In Lines 159-162 and Appendix C, we present the multi-source CoWorld that can automatically discover a beneficial task from multiple source candidates. In Fig 3 (left), the multi-source CoWorld consistently achieves positive transfer (highlighted by the dashed blue box). The right side of Fig 3 demonstrates the effectiveness of the multi-source CoWorld. It yields results comparable to those of the top-performing single-source CoWorld.
Q3: Why maintain separate source and target world models?
First, using separate world models allows the source and target domains to have significantly different observations. See General Response (Q2, Point(1)) for additional empirical results where the source domain has a low-dim observation space.
Additionally, we compare CoWorld (w/ separate models) and Multi-Task DV2 (w/ one model) in Table 7 in the Rebuttal PDF. The latter involves training on offline and online data with a joint world model and separate actor-critic models. CoWorld consistently performs better in terms of the cross-environment results (Meta-World -> RoboDesk).
Q4: Training on multiple source domains.
As suggested, we design two baseline models trained with multiple source domains and compare them with our proposed multi-source CoWorld:
- DV2-FT (multi-source pre-train), which involves pre-training on DC, WC, and DC* as sources for 40K steps on each task and then finetuning on BP for 300K steps.
- CoWorld (multi-source co-train), which involves co-training with DC, WC, and DC*, as well as the target (BP), achieving a return of 3012.
As shown in Table 8 in the Rebuttal PDF, training on multiple sources can lead to negative transfer. In contrast, CoWorld can identify appropriate auxiliary tasks, enabling it to better manage the potential negative impacts of less relevant source tasks
Q5: Computational complexity.
We present the results on Meta-World (BP -> HP) in Fig 1 in the Rebuttal PDF. CoWorld achieves convergence (90% of the highest returns) in approximately 14 hours; while it costs DV2 Finetune (DV2-FT) about 13 hours. These results indicate that CoWorld requires a comparable training wall-clock time to DV2-FT, while consistently maintaining better performance in terms of returns after model convergence.
We further illustrate the efficiency of CoWorld from two aspects:
- Convergence time: From the results above, CoWorld and DV2-FT achieve comparable convergence speeds in wall-clock time. Thereafter, our approach consistently outperforms DV2-FT.
- Improving efficiency: In Table 6 in the Rebuttal PDF, we add new experiments demonstrating that efficiency can be further optimized by flexibly reducing the frequency of source training steps during the co-training phase. The overall performance still surpasses that of DV2-FT (3702 451) and DV2-FT-EWC (528 334).
CoWorld, like the baseline models, can be trained using a single 3090 GPU.
Q6: More insight into selecting an appropriate source domain.
Recognizing the challenges of manually selecting an appropriate source domain, we propose a multi-source CoWorld approach with an adaptive domain selection method.
First, we typically have some information about the target task, based on which we can initially construct a set of source domain candidates (e.g., if the target domain is a robotic arm visual control task, we can randomly select several tasks from Meta-World or other environments as potential source domains), and we allow some of these source domains to differ significantly from the target task.
Next, we propose a method for adaptive source domain selection. This is achieved by measuring the distance of the latent states between the offline target dataset and each source domain provided by different world models. We have included technical details in Appendix C and corresponding results in Fig 3.
Q7: The performance gains over simpler baselines (fine-tuning).
Notably, finetuning requires a very high similarity between the source and target domains, resulting in poor performance in the following three scenarios:
- Cross-environment experiments (Fig 4)
- Adding-noise experiments (Table 6)
- Scenarios with sparse rewards in the source domain (General Response Q2, Point(2)).
As shown in Table 9 in the Rebuttal PDF, CoWorld demonstrates significant improvement over the fine-tuning method, particularly in cases where there are more pronounced domain discrepancies.
Furthermore, another disadvantage of the finetuning method is that it cannot be applied in scenarios where source and target domains have observation spaces of different dimensions. See General Response (Q2, Point(1)).
Q8: What if the source and target domains have significantly different observation spaces?
We have conducted new experiments using a source task with low-dimensional states (Meta-World) and a target domain characterized by high-dimensional images (RoboDesk). Please refer to General Response (Q2, Point(1)) for the results.
Q9: Clarity in presentation.
We hope Table 10 in the Rebuttal PDF can make the paper easier to understand.
Thanks to the authors for the detailed response.
It is still a bit unclear to me why "fine-tuning requires a very high similarity between the source and target domains" but the proposed method does not.
Thank you for your question.
First, fine-tuning the world models requires that the observation and action spaces remain consistent between the source and target domains. Here are two cases where it may not be directly applicable:
- Observation Mismatch (as discussed in the rebuttal): The source domain observations consist of low-dimensional data, while the target domain observations are high-dimensional images. CoWorld improves Offline DV2 by 13.3% and 34.0%, while the fine-tuning method can be used directly. Additionally, even with the same observation dimensions, viewpoints or image appearance variations can impact fine-tuning results, as demonstrated in the MetaWorld RoboDesk experiments.
- Action Mismatch (as presented in Section 4.3 of the paper): In cross-environment experiments, it is common for the source and target domains to have different action spaces. Although we can pad the action inputs of the world model for dimensional alignment, it remains challenging for the fine-tuning method to establish correspondence on each action channel (considering that different environments may have different physical meanings for each action channel).
Our method, however, employs separate world models and aligns them in the latent state space, which allows us to handle differences in both observation and action spaces.
Second, another limitation of fine-tuning is the risk of "catastrophic forgetting" in the target domain. After multiple training iterations on the target domain, the model may forget prior knowledge learned from the source domain. In contrast, our method avoids this issue through co-training, thus preserving the knowledge from the source domain.
Furthermore, in our rebuttal, we provided results demonstrating that CoWorld can effectively handle notable reward mismatches. When the source domain presents only sparse rewards, our approach achieves an average performance gain of 6.6% compared to the fine-tuning method (see Table 2 in the Rebuttal PDF). This improvement is partly attributed to our reward alignment method.
If you have further questions, feel free to let us know.
Thank you for the explanation. I have updated my rating.
This paper studies model-based offline reinforcement learning with visual observations. This work focuses on the overfitting problem in state representation and the overestimation bias for the value function and introduces the environment model from related tasks to present an offline-online-offline solution. The proposed solution contains state alignment, reward alignment, and value alignment. The empirical studies conduct the experiments with transfer across heterogeneous tasks.
优点
-
The proposed setting is very interesting to me. The author provides a new perspective on reinforcement learning cross-task transfer from the angle of transfer learning, which is challenging.
-
The proposed solution is reasonable. When online interaction in the current task is unavailable, leveraging online interaction from related tasks is similar to traditional transfer learning. Although the offline-online-offline process appears somewhat complex, it indeed provides an idea that may promote related research.
缺点
- The provided results have significant standard deviations. I am not sure if the authors taking 10 episodes over 3 seeds provided a sufficient evaluation. This issue can be referred to: [Deep reinforcement learning at the edge of the statistical precipice. NeurIPS’21].
- In fact, I highly appreciate the contributions made by the authors, if the statistical results of the experiment are reliable (see Weakness 1). This work successfully achieved transfer in multiple environments, which is rare for me, and makes a significant contribution. But where does the success of the transfer come from? The success of state alignment is relatively easy to understand, similar to traditional visual domain adaptation. But what does the success of reward alignment come from? From information like "objects should be picked" in the reward signal? This seems to impose relatively strict limitations on the relevance of the task, which is lacking in discussion in the current version.
- Similarly, what are the conditions for mitigating value overestimation? Figure 2 is well visualized, but it seems to implicitly assume that the source critic and the ideal target critic perform similarly on the broad support. If based on such an assumption, cross-environment transfer does not seem so difficult. I still affirm the contribution of this work, but can the authors provide a formal analysis of this and determine the applicable conditions for the proposed approach?
问题
- How to ensure that source critic is beneficial to the target task, and is it possible to have a negative transfer when facing conflicting tasks?
- See the weakness 3.
After rebuttal, my concerns have been addressed well. I decide to raise my score to 6.
局限性
The authors have provided a discussion about the limitations.
Thank you for your constructive comments. Our responses are provided below. If you have any further questions, please feel free to let us know.
Q1: Not sure if taking 10 episodes over 3 seeds is sufficient.
We extended the results from 3 random seeds to 5 random seeds and tested the performance variations with 3, 5, and 10 random seeds, as well as 10, 20, 50, and 100 episodes. The experiments demonstrate that CoWorld consistently outperforms the baseline model DV2 Finetune (DV2-FT) in terms of performance stability across various tasks and different numbers of random seeds. Please refer to General Response (Q3) for more details.
Q2: The source of success in transfer, particularly in reward alignment, is unclear and seems to impose strict limitations on task relevance, which is not sufficiently discussed.
(1) On the reasoning behind 'Online-to-Offline Reward Alignment'
First, let us review the training details in the reward alignment process:
- Step 1: We sample from the target offline dataset;
- Step 2: In Eq. (4), we use the source reward predictor to predict based on the target inputs;
- Step 3 (Important!): Also in Eq. (4), we mix the previous output and the true target reward to generate ;
- Step 4 (Important!): In Eq. (5), we use to supervise the training of based on the target inputs;
- Step 5: Also in Eq. (5), we remain the training term of based on the source inputs to approach the raw source reward data .
Clearly, due to the aforementioned Steps 3-4, the reward model learns to approximate the reward combined with the true target reward conditioned on the target inputs and , where are from the same target tuple. In this way, the reward alignment process integrates target information into the learned reward function of the source domain. Consequently, it enables the subsequent learning process of the source critic to be informed with rich target domain knowledge.
(2) Can CoWorld benefit from a source domain with a significantly different reward signal?
Yes, CoWorld can benefit from a source domain with a significantly different reward signal. In our experiments, the source rewards are sparse and the target rewards are dense. The experiment results demonstrate that although excessively sparse rewards can hinder the training process, CoWorld still outperforms DV2-FT with a more balanced reward structure. For detailed results and analysis, please refer to General Response (Q2, Point(2)).
Q3: What are the conditions for mitigating value overestimation?
Good question! Regarding the conditions for mitigating value overestimation, Figure 2 illustrates our approach effectively. However, it does indeed assume that the source critic and the ideal target critic perform similarly across a broad range of states. This assumption can be softened by our proposed approaches, namely, 1) state alignment and 2) reward alignment. Through these approaches, we aim to mitigate domain discrepancies between distinct source and target MDPs.
According to Eq. (9) in Appendix A.4, the source critic is trained to estimate the expected future rewards \mathbb E[\sum\_{\tau \ge t\}({\gamma}\_{\tau-t} \cdot \hat{r}\_\tau)] given the source latent at timestamp . There are two critical points here:
- For the training supervisions, during the imagination phase, is optimized ONLY based on the n-step predicted rewards (Algorithm 1, Lines 22-24), rather than the ground-truth source rewards. As previously demonstrated, the predicted rewards contain a mix of information from both the source and target domains.
- At the input end, is indeed trained on the imagined source latents. However, due to the state alignment process, the source latents share close distributions with the latents conditioned on target inputs.
Thus, it is not entirely accurate to suggest that is trained to 'maximize the expected sum of source rewards in a range of source latents.' In fact, it learns to maximize the expected sum of target-informed rewards across latents that are closely distributed to the target latents. Consequently, the source critic can be effectively applied to the target domain to guide the training of the target agent, using the target latents as inputs.
Q4: How to ensure that source critic is beneficial to the target task, and is it possible to have a negative transfer when facing conflicting tasks?
In Figure 3 (left) of the original manuscript, we evaluate our model's performance across each pair of source and target domains. We observe positive transfer effects (indicated by the red blocks) in most cases (18/20 single-domain transfer scenarios). And indeed, we also observe negative transfers (indicated by the blue blocks) in 2/20 scenarios (e.g., DC BP).
To ensure an effective source critic, as described in Lines 159-162, we first manually select several candidates for the source domain that are potentially relevant to the target task. We then propose a method for multi-source training, with technical details provided in Appendix C. By introducing the multi-source CoWorld, we can automatically discover a beneficial task from multiple source candidates. As shown in Figure 3 (left), the multi-source CoWorld consistently achieves positive transfer (indicated by the results inside the dashed blue box).
On the right side of Figure 3, we further demonstrate the effectiveness of the multi-source domain selection method. It achieves comparable results to that of the top-performing single-source CoWorld.
Dear Reviewer Bpkq,
Thank you once again for your time in reviewing our paper. We have made every effort to improve the overall clarity in the rebuttal. We would appreciate it if you could check our feedback and see if any concerns remain.
Best regards,
Authors
Thanks for your response, which addressed my concerns. I will raise my score to 6.
This paper studies CoWorld, a model-based RL method that learns the policy through an offline-online-offline framework. Specifically, CoWorld assumes that, in addition to the offline dataset, there is another (simulated) environment for online interaction. Through the offline dataset and interactive environment, CoWorld constructed two world models to train the separate policies. Next, this paper introduces three components to enhance the policy performance, including offline-to-online state alignment, online-to-offline reward alignment, and min-max value constraint. The proposed CoWorld and several baselines are evaluated on Meta-World, RoboDesk, and DMC benchmarks. From the results presented, this paper suggests that CoWorld outperforms existing RL methods by large margins.
优点
- This paper proposes an interesting method for training an RL policy by co-learning from both the offline dataset and the interactive environment.
- This work presents a reasonable number of ablation studies and experiments to verify the proposed method. These experiments cover various directions, including environmental settings, component contributions, and hyperparameter sensitivity.
- This paper is generally well-written and easy to follow. The depicted figures and tables are well-articulated.
缺点
- While I agree the idea of CoWorld is interesting, I feel that it may not be easy to find suitable practical applications. After all, a low-cost, high-efficiency interactive environment is itself a strong assumption. It would be more convincing if the author could describe practical situations where the proposed method is applicable.
- Related to the first point, CoWorld learns the separate world models from both the offline dataset and the online environment. I'm curious about how significant the difference between these two sources can be. The setting used in this paper is only a slight difference (4 vs. 5 in the action space, or scale in reward space). But when the two resources are significantly different, can CoWorld still work? For example, (1) one environment with dense reward and another one with sparse reward (not just scale), or (2) the observation and action spaces are obviously different from the two sources.
- From Table 2, I am concerned about the claim that the proposed CoWorld significantly outperforms other baselines because (1) in several cases, the CoWorld's performance overlaps with other methods when considering its standard deviations, and (2) the results are obtained from only three random seeds. I encourage the authors to conduct more experimental runs and/or provide statistical analysis to eliminate this concern.
- A minor point: while this paper is generally well-written, I found some typos that should be easily fixed:
- The notation T represents both the target environment and the maximum number of time steps. (line 14 in Algorithm 1)
- Target/Offline (S) should be T? (Table 4, page 13)
- ... on DMC Meidum-Expert Dataset Medium (line 481, page 14)
- ... as illustrated in Table 1 should be Table 6? (line 516, page 17)
Based on the observed strengths and weaknesses, I will set my initial recommendation as "Borderline accept".
I will be happy to raise my rating if the authors appropriately address my concerns or point out my misunderstandings.
问题
Please consider addressing my concerns and questions in Weaknesses; I will list some key ones:
- In what practical scenarios is the proposed CoWorld method applicable? Can you provide more examples and details?
- How does CoWorld operate when the two world model architectures have greater differences? Can it still effectively transfer information between these two resources?
- How to prove that CoWorld is really significantly better than other methods? Can you provide the results of more experimental runs or statistical analysis?
局限性
A major limitation is that the proposed CoWorld method is applicable to only a limited number of practical application scenarios. Other limitations are adequately discussed in the Limitation Section.
We appreciate your great efforts in reviewing our paper and hope that the following responses can address all of your concerns.
Q1: Practical situations where the proposed method is applicable.
Good question! CoWorld can be applied in various real-world scenarios such as robot control, autonomous driving, healthcare, advertising bidding, and power systems. For instance, it can leverage established simulators like Meta-World and CARLA for robotics and autonomous driving, or use medical simulators to evaluate clinical treatments in healthcare. Similarly, it can enhance advertising systems through simulated bidding environments and improve power system operations using electrical system simulators. Please refer to our General Response (Q1) for further details.
Q2: How does CoWorld operate when the two world model architectures have greater differences? Can it still effectively transfer information between these two resources?
To investigate the effectiveness of CoWorld under significantly different world model architectures, extended experiments were conducted where the source domain observations were low-dimensional states and the target domain observations were high-dimensional images. CoWorld achieves competitive performance against strong baselines. Please refer to General Response (Q2, Point(1)) for more details.
Q3: How to prove that CoWorld is really significantly better than other methods? Can you provide the results of more experimental runs or statistical analysis?
We extended the results from 3 random seeds to 5 random seeds and tested the performance variations with 3, 5, and 10 random seeds, as well as 10, 20, 50, and 100 episodes. The experiments demonstrate that CoWorld consistently outperforms the baseline model DV2 Finetune (DV2-FT) in terms of performance stability across various tasks and different numbers of random seeds. Please refer to General Response (Q3) for more detail.
Q4: Line 14 in Algorithm 1: The notation represents both the target environment and the maximum number of time steps.
Sorry for the misleading notations. We will use to represent the maximum number of time steps instead of in the revised paper. For example: .
Q5: Typos in Table 4 (Page 13) and Line 481 (Page 14).
We will correct these typos in the revised paper.
Q6: Line 516 (Page 17): ...as illustrated in Table 1 should be Table 6?
In Line 516, we wanted to emphasize that the cross-environment setup from Table 1 is even more challenging than the adding-noise setup shown in Table 6, due to more significant domain gaps. Corresponding results for the cross-environment setup are presented in Section 4.3 of the main text.
I appreciate the authors' efforts in addressing my concerns. My issues regarding the differences between the two environments, performance stability, and minor writing problems have been well-resolved.
Additionally, the authors have provided several practical scenarios where CoWorld may be applicable. However, I still believe there may be inevitable obstacles when it comes to actual implementation. For example, the physical simulation capabilities of current off-the-shelf simulators still have significant room for improvement. As a result, we may struggle to construct simulated environments that are realistic enough for our needs.
Nevertheless, I will increase my score to 6 to acknowledge the efforts the authors have made in their rebuttal.
We appreciate the reviewer's prompt feedback.
Admittedly, existing simulation environments may not adequately address the diverse challenges of real-world applications, creating obstacles for the immediate application of our work. This is precisely why we have included extensive experiments—by incorporating action noise, using different observation dimensions, and employing simulation environments with sparse rewards—we have shown that our model can achieve significant performance gains despite substantial domain differences.
Fortunately, the development of increasingly sophisticated RL simulators enhances the potential for the practical use of our approach in the future.
Lastly, we would like to thank the reviewer once again for raising the score.
This work proposes a model-based algorithm for offline visual reinforcement learning using domain transfer learning. The proposed method demonstrates superior performance over existing approaches in several baselines.
优点
S1. The paper presents the necessity of proposed modules in the algorithms through ablations and validation in multiple setups.
S2. The proposed method outperforms baselines in different benchmarks.
缺点
W1. The cost of training time is higher in comparison to baselines.
问题
Q1. How does the performance of DV2 Finetune, DV2 Finetune+EWC, and CoWorld look when the x-axis is training wall-clock time instead of the number of training iterations?
局限性
Yes, the authors have addressed the limitations.
Thank you for your comments. We have conducted new experiments to discuss the training cost of our model. Please refer to the results presented in the Rebuttal PDF. We hope our responses will address your specific concerns regarding training efficiency.
Q1: How does the performance look when the x-axis is training wall-clock time instead of the number of training iterations?
Per the reviewer's request, we present the results on the Meta-World benchmark (Button Press Handle Press) in Figure 1 in the Rebuttal PDF. As shown, CoWorld achieves convergence (90% of the highest returns) in approximately 14 hours; while it costs DV2 Finetune (DV2-FT) about 13 hours. These results indicate that CoWorld requires a comparable training wall-clock time to DV2-FT, while consistently maintaining better performance in terms of returns after model convergence.
We will include these results in the revised paper.
Q2: The cost of training time is higher in comparison to baselines.
We further illustrate the efficiency of CoWorld from two aspects:
- Convergence time: From the results above, CoWorld and DV2-FT achieve comparable convergence speeds in wall-clock time. Thereafter, our approach consistently outperforms DV2-FT.
- Improving efficiency: In Table 6 in the Rebuttal PDF, we add new experiments demonstrating that efficiency can be further optimized by flexibly reducing the frequency of source training steps during the co-training phase. The overall performance still surpasses that of DV2-FT (3702 451) and DV2-FT-EWC (528 334).
I thank the authors for their response. I have carefully read the rebuttals and decided to keep my current rating.
General Responses to All Reviewers
In addition to the specific responses below, we here reply to the general questions raised by the reviewers.
Q1: Practical situations where the proposed method is applicable.
As demonstrated in Table 1 of our manuscript, our method can be used in scenarios despite notable domain gaps in terms of dynamics, visual appearances, and action space. Typical real-world applications include robot control, autonomous driving, healthcare, advertising bidding, and power systems.
- Robotics and autonomous driving: For the first two scenarios, established simulators such as Meta-World and CARLA [1] are readily available, and numerous existing studies have explored the transfer learning problem using these simulators.
- Healthcare applications: Existing RL models in this field [2] are learned on large volumes of offline clinical histories. Additionally, we have medical domain simulators [3-4] designed by experts, which allow us to evaluate the learned clinical treatments. Exploring how to use these simulators to improve RL therapies in real-world data would be a promising research avenue.
- Advertising bidding: Since direct online interactions with real advertising systems are impractical, recent work has constructed a simulated bidding environment based on historical bidding logs to facilitate interactive training [5]. In this context, the next research focus may be on how to effectively use such simulators to enhance real-world advertising systems.
- Power system: In the realm of power systems, we can use the power system simulator as the source domain, with real power system data serving as the target domain. Recent RL-related studies in power systems [6-7] demonstrate the effectiveness of training RL controllers on high-fidelity simulations or offline data before deploying them on real-world systems, such as service restoration in microgrids and demand response in electric grids.
In summary, there are many real-world scenarios where existing simulators can be employed to improve offline RL policies. As more and more simulators are developed, the research on online-to-offline RL transfer will become increasingly practical.
References:
[1] Dosovitskiy et al. "CARLA: An Open Urban Driving Simulator", CoRL, 2017.
[2] Zhang et al. "Continuous-Time Decision Transformer for Healthcare Applications", AISTATS, 2023.
[3] Hua et al. "Personalized Dynamic Treatment Regimes in Continuous Time: A Bayesian Approach for Optimizing Clinical Decisions with Timing", Bayesian Analysis, 2022.
[4] Adams et al. "Dynamic Multidrug Therapies for HIV: Optimal and STI Control Approaches", Math Biosci Eng, 2004.
[5] Mou et al. "Sustainable Online Reinforcement Learning for Auto-bidding", NeurIPS, 2022.
[6] Du et al. "Deep reinforcement learning from demonstrations to assist service restoration in islanded microgrids", IEEE Trans. Sustain. Energy, 2022.
[7] Lesage-Landry et al. "Batch reinforcement learning for network-safe demand response in unknown electric grids", Electric Power Systems Research, 2022.
Q2: Dependence of CoWorld to the source domain.
We further elaborate on CoWorld’s dependence on the similarity of the source domain from the perspectives of different observation spaces and reward spaces.
(1)How would CoWorld perform if the source and target domains have significantly different observation spaces?
To investigate the effectiveness of CoWorld under significantly different observation spaces, we conduct experiments on MetaWorld RoboDesk tasks, where the source domain observations consist of low-dimensional states, and the target domain observations are high-dimensional images. As illustrated in Table 1 in Rebuttal PDF, CoWorld outperforms Offline DV2 by 13.3% and 34.0% due to the ability to effectively leverag low-dimensional source data.
NOTE: the finetuning baseline method (DV2-FT) is not applicable is this scenario.
(2)Can CoWorld benefit from a source domain with a signifcantly different reward signal?
In the Meta-World RoboDesk, we modify the rewards of the source domain (Meta-World) to be sparse in order to investigate the impact of the source task's reward on target training. The reward in the source domain is set to 500 only upon task completion and remains 0 before that. Despite the significant differences in rewards, CoWorld achieves an average performance gain of 6.6% compared to DV2-FT. Please refer to Table 2 in the Rebuttal PDF.
Q3: The performance stability of CoWorld.
(1) We extend the 5-seeds experiments to the four tasks and observe consistent results with those obtained using 3 random seeds. This indicates that using 3 random seeds is sufficient to reveal the differences in average performance between the compared models. Please refer to Table 3 in the Rebuttal PDF.
(2) We compare the performance of CoWorld and DV2 Finetune (DV2-FT) using 3, 5, and 10 random seeds, respectively. The results show that CoWorld achieves stable performance across different numbers of training seeds and consistently outperforms the baseline models. Please refer to Table 4 in Rebuttal PDF.
(3) To assess the stability of the model during testing, we conduct further experiments running 10, 20, 50, and 100 episodes in the Meta-World Button Press Handle Press task. We calculate the mean and standard deviations for each case. The results below indicate that testing 10 episodes under each random seed is sufficient to demonstrate the improvement of our model over the baseline model in terms of average performance. Please refer to Table 5 in the Rebuttal PDF.
NOTE: In RoboDesk, OS is short for Open Slide and PB is short for Push Button.
Summary
This paper studies offline reinforcement learning (RL) challenges with visual inputs, specifically focusing on overfitting in representation learning and overestimation bias in future reward predictions. The authors propose CoWorld, a model-based RL approach that integrates offline and online learning by utilizing existing RL simulators for policy testing. CoWorld aims to align state and reward spaces across domains to facilitate effective knowledge transfer between online and offline environments. Experimental results show that CoWorld significantly improves performance compared to existing methods.
Decision
The paper is generally well-written and easy to understand. The depicted figures and tables are well explained. The topic the paper is studying is relevant to the neurips community and essential to improving the performance of offline RL algorithms. As the authors noted, these findings can potentially have implications for robotic control, autonomous driving, and healthcare applications. The paper demonstrates competitive results across three RL benchmarks.
The reviewers pointed out several concerns during the rebuttal phase, which the authors addressed successfully. However, I would recommend that the authors incorporate those changes into the camera-ready version of the paper. Most of these changes are minor, like highlighting some of the potential applications and experiments with more seeds.