PaperHub
6.7
/10
Poster3 位审稿人
最低6最高8标准差0.9
8
6
6
3.0
置信度
正确性3.0
贡献度3.0
表达2.7
ICLR 2025

Cross-Domain Offline Policy Adaptation with Optimal Transport and Dataset Constraint

OpenReviewPDF
提交: 2024-09-25更新: 2025-03-06
TL;DR

We propose a simple yet effective method that leverages optimal transport and support constraint for efficient cross-domain offline RL.

摘要

关键词
cross-domainreinforcement learningoffline RLoptimal transport

评审与讨论

审稿意见
8

This work proposes a new cross-domain offline RL method OTDF, which can leverage data from source domain to help the learning of the target domain. This work proposes that the core of the cross-domain offline RL is the dynamics mismatch between source and target domain, and introduces a threshold and a weight to only leverage those data close to the target domain in the source domain.

优点

  • This work proposes an informative proof and well-designed method. Also, this work can be a well baseline for follow-up works.

缺点

The μt,t\mu_{t,t'} is not defined in Eq. 1.

问题

Can a more complex kinematic shift be proposed?

评论

We thank the reviewer for commenting that our work proposes an informative proof and well-designed method and that our method can serve as a well baseline for follow-up works. Please check our clarifications to the concerns below.

Concern 1: μt,t\mu_{t,t^\prime} is not defined

Thanks for the question. μRn×n\mu\in \mathbb{R}^{n\times n^\prime} is the coupling matrix, and μt,t\mu_{t,t^\prime} denotes its tt-th row, tt^\prime-th column element, t=1,,n,t=1,,nt=1,\ldots,n, t^\prime=1,\ldots,n^\prime, which corresponds to the coupling with respect to x_t,y_tx\_t, y\_{t^\prime}. We have made it clearer in the revision.

Concern 2: Can a more complex kinematic shift be proposed?

Yes, it is feasible. The kinematic shift can occur at different parts of the robot.

  • for halfcheetah task, we modify the rotation angle of the joint on the thigh of the robot’s back leg in the submission. Furthermore, one can modify the rotation range of the foot joint to fulfill kinematic mismatch
  • for hopper task, one can modify the rotation range of either the head joint or the foot joint (we already modified the rotation angle of both the head joint and the foot joint in our paper)
  • for walker2d task, we modify the rotation angle of the foot joint on the robot’s right leg in our submission. One can also modify the rotation range of the thigh joint.
  • for ant task, we modify the rotation angles of the joints on the hip of two legs in our original submission. We can also modify the rotation range of the ankle joint

Note that the joints can be broken simultaneously, and it depends on the user to design the kinematic shift (e.g., the broken joints occur only in one joint or multiple joints).

To examine the effectiveness of OTDF under possibly more complex kinematic shifts, we consider two domains (walker2d, ant) and modify their XML files. We only evaluate on two domains because we need to train the agent in the modified environment to ensure that the agent can get meaningful performance and modify the XML files if not. We also need to gather offline datasets with different qualities. All these consume a lot of time. For the walker2d task, we modify the rotation range of the thigh joint in the target domain to be 0.2 times of that in the source domain:

<joint axis="0 -1 0" name="thigh_joint" pos="0 0 1.05" range="-30 0" type="hinge"/>
<joint axis="0 -1 0" name="thigh_left_joint" pos="0 0 1.05" range="-30 0" type="hinge"/>

For ant task, we modify the rotation range of the ankle joint of all legs from [30, 70] to [30, 38]:

<joint axis="-1 1 0" name="ankle_1" pos="0.0 0.0 0.0" range="30 38" type="hinge"/>
<joint axis="1 1 0" name="ankle_2" pos="0.0 0.0 0.0" range="-38 -30" type="hinge"/>
<joint axis="-1 1 0" name="ankle_3" pos="0.0 0.0 0.0" range="-38 -30" type="hinge"/>
<joint axis="1 1 0" name="ankle_4" pos="0.0 0.0 0.0" range="30 38" type="hinge"/>

We then first train the SAC agent in the above new kinematic tasks and gather medium, medium-expert, and expert offline datasets following D4RL. All target domain datasets still have a limited budget of 5000 transitions. Due to the limited rebuttal period, we only run OTDF and IQL* (with both domain data) on some modified kinematic tasks (medium/medium-replay/medium-expert source domain dataset and medium/medium-expert target domain dataset). We use the default hyperparameters of OTDF and summarize the results below. All results are averaged across 5 different random seeds, and we report the normalized scores on the target domain.

SourceTargetIQL*OTDF (ours)
walk-mmedium64.7±\pm4.167.0±\pm0.0
walk-mmedium-expert64.6±\pm5.860.6±\pm11.1
walk-m-rmedium51.9±\pm8.154.3±\pm0.0
walk-m-rmedium-expert52.2±\pm12.951.7±\pm16.0
walk-m-emedium97.6±\pm16.498.1±\pm16.8
walk-m-emedium-expert95.5±\pm6.693.7±\pm3.5
ant-mmedium59.8±\pm5.781.0±\pm4.8
ant-mmedium-expert43.2±\pm2.074.6±\pm11.5
ant-m-rmedium50.1±\pm4.478.4±\pm10.4
ant-m-rmedium-expert41.9±\pm1.062.4±\pm6.4
ant-m-emedium73.1±\pm1.683.4±\pm5.7
ant-m-emedium-expert71.3±\pm2.680.8±\pm14.3
Total765.8886.0

Table 1. Results on new kinematic tasks. walk=walker2d, m=medium, r=replay, e=expert. We report the mean normalized score with the standard deviations.

It can be seen that OTDF still outperforms IQL* on most of the tasks, further demonstrating the effectiveness of our method. One can also modify the kinematic tasks by simulating broken joints at multiple joints (e.g., simultaneously broken ankle joints and hip joints at the ant task) to construct more diverse and complex kinematic shifts and run OTDF on them.

Hopefully, these can resolve the concerns. If there is still something unclear, please let us know!

评论

Dear Reviewer vhAr, thank you very much for the helpful review and positive rating of our work. It would be great if you could give us some comments on our rebuttal, and kindly check our revision.

评论

Dear Reviewer vhAr, thank you for your constructive review! We hope that our rebuttal and the revised manuscript can address your concerns. We would appreciate it if you could give us some feedback. Please let us know if there is still anything unclear!

审稿意见
6

This paper proposes a novel method named Optimal Transport Data Filtering (OTDF) for cross-domain offline policy adaptation. Previous works still need a comparatively large target domain dataset to learn domain classifiers or filtering via mutual information. In this paper, the authors provide a performance bound to identify the factors of the performance deviation. This paper proposes to use optimal transport to align data distributions and a regularization term to ensure the learned policy remains within the target domain’s span. The empirical study shows that OTDF outperforms existing methods in cross-domain offline policy adaptation.

优点

  • The paper introduces a theoretical analysis of the performance bound in cross-domain offline policy adaptation, providing a clear foundation for the proposed method.
  • The optimal transport-based data filtering method is motivated and efficient to compute the Wasserstein distance between source and target domains and filter the source data accordingly.
  • The empirical results across various environments and types of dynamics shifts demonstrate that OTDF significantly outperforms baseline methods based on the IQL.

缺点

  • The proposed practical implementation of OTDF still needs to learn a conditional variational auto-encoder to estimate the behavior polic which may introduce additional computational overhead.
  • The paper argues that the proposed method can use a small target domain dataset however the detailed explanation of theoretical and empirical analysis is still limited.
  • Ablation study on the proposed method is limited and the comparison with other dynamic model-based methods or adaptive transfer learning approaches is not comprehensive.

问题

  • Can you provide more insights on using the small amount of target domain data in the proposed method? Can you provide a more detailed explanation from the theoretical and empirical sides?
  • Can you explain more about the CVAE you used in the proposed method? Why use the CVAE to estimate the behavior policy? How does the CVAE affect the performance of the proposed method?
  • Can you discuss and cite more related works on cross-domain policy adaptation such as [1], [2], and [3], and how the proposed method compares to these methods?
  • Can you explain why use the different performance level target domain dataset from the source domain dataset in the empirical study? For example, why use the medium dataset in the source domain combined with the expert dataset in the target domain? Is it a reasonable setting for cross-domain policy adaptation?
  • Do you have any results for other offline RL backbones like CQL, BCQ, or BEAR? How does the proposed method compare to these methods in cross-domain offline policy adaptation?

[1] When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning

[2] Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation

[3] Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

评论

We thank the reviewer for the insightful review. We appreciate that the reviewer acknowledges that our work provides a clear foundation for the proposed method and that OTDF significantly outperforms baseline methods. Please find our clarifications to the concerns below

Concern 1: on the computational overhead

We clarify that the CVAE is actually pre-trained before the policy learning stage of OTDF (as stated in Line 281). This indicates that no training costs of CVAE will be included during the training of OTDF. Meanwhile, the computation cost of training CVAE is also tolerable since we only train CVAE for 10000 steps (please see Table 5 for hyperparameter setup). Despite that measuring the log probabilities of the action lying in the span of the target domain dataset can still require some extra computational overhead. We respectfully argue that there is no free lunch. Many cross-domain offline RL methods also suffer from the same issue, e.g., DARA [1] needs to train domain classifiers and construct reward penalties, and IGDF [2] includes the contrastive learning objective for data filtering. These can all introduce additional computational overhead. Based on our experiments, we find that OTDF enjoys a runtime similar to those baseline algorithms

Concern 2: insights on using the small amount of target domain data in OTDF

The insights behind the small amount of target domain data lie in the following aspects:

  • under many scenarios, the target domain data is hard to acquire. It can be expensive to gather data in the target domain (e.g., real-world robotics data). It is somewhat unrealistic to train policies using a large amount of target domain data (e.g., 1e5). Nevertheless, a small amount of target domain data (e.g., 5000) is more acceptable and practicable
  • human beings are able to quickly adapt to the downstream tasks with only a small amount of data. For example, Alice used to play tennis without any exposure to other ball sports (Line 33 of the main text), and she never played badminton before. However, based on her experience in tennis, she can play badminton after a few rounds of play. We believe such ability is also expected in decision-making agents. Hence, we only provide a limited budget of target domain data to realize efficient offline policy adaptation
  • existing cross-domain offline RL papers still rely on a large amount of data (e.g., papers like DARA/IGDF/BOSA rely on 10% D4RL data, which can still result in 1e5 data eventually, which is indeed a comparatively large amount of data). We note that this can downgrade the necessity of a source domain dataset since one can get satisfying performance on them with existing strong offline RL methods. This can be verified in Table 1 in the BOSA paper [3], and Table 1 in the IGDF paper, where offline RL methods like IQL, CQL, and SPOT can achieve quite strong performance with 10% D4RL data. In contrast, we find that these offline RL methods typically fail under the single-domain setting when only 5000 data is available

We are sorry that a strict and formal theoretical analysis of a small amount of data can be difficult. To the best of our knowledge, there are no suitable mathematical tools for connecting the performance bound of the policy and the dataset size in a general manner. There exist some works that try to do so, e.g., [4], but they often rely on some assumptions that deviate far from the practical applications

Concern 3: more explanations on the CVAE

We use CVAE to fulfill the dataset constraint, i.e., using CVAE to estimate the behavior policy of the target domain dataset and maximize the log probability of the current policy lying in the span of the target domain dataset. This can intuitively prevent the learned policy from getting biased towards the source domain data, which is vital since we focus exactly on the agent's performance in the target domain. Theoretically, incorporating the dataset constraint term can better control term (b) in Theorem 3.1. Please refer to more details in Lines 216-233 of the main text.

We use CVAE to model the behavior policy due to the fact that (a) CVAE is a very popular generative model with good theoretical foundations, which can generate in-distribution samples; (b) CVAE is widely adopted and proven to be effective in the offline RL community to estimate the behavior policy, e.g., BCQ [5], SPOT [6], PLAS [7], etc. (c) Using CVAE can incur better performance. It turns out that CVAE is very important to OTDF. In Figure 3 of the main text, we conduct parameter study on the policy regularization coefficient β\beta, which determines the strengths of the dataset constraint, i.e., a larger β\beta indicates a stronger constraint and vice versa. Setting β=0\beta=0 usually leads to a significant performance drop. The impact of β\beta can be large on some tasks, indicating that CVAE is undoubtedly a critical component of OTDF.

评论

Concern 4: Ablation study and baseline comparison is not comprehensive

We respectfully argue that our ablation study is sufficient, where we show the effectiveness of data filtering in Figure 2 (ξ=100\xi=100 indicates no data filtering, which usually incurs performance drop), and the effectiveness of dataset constraint term in Figure 3 (β=0\beta=0 means no dataset constraint term, which often results in performance drop). We also provide the ablation study on the source domain data weight in Figure 5, which shows its necessity. All of these ablation studies are conducted on 4 tasks (2 kinematic tasks and 2 morphology tasks). Please let us know if the reviewer wants more environments to be covered in the ablation study, and we will add them to the final version.

Our comparison against the baseline methods is also comprehensive. We would like to emphasize that we focus on the cross-domain offline policy adaptation scenario, where both the source domain and the target domain are offline, and no online interactions with either the source domain or the target domain are allowed. It indicates that many of the off-dynamics RL papers in the literature are not suitable or applicable to serve as baselines (e.g., DARC, H2O [8], DARAIL [9]). We have tried our best to include some representative and strong methods for comparison, e.g., DARA and BOSA. Meanwhile, IGDF is a very recently published cross-domain offline RL paper in ICML 2024 and achieves strong performance on numerous dynamics shift scenarios. We hence believe that our selected baselines are strong, relevant, and comprehensive.

Concern 5: discuss and cite more related works on cross-domain policy adaptation

Thanks for recommending these works. We have cited H2O [8] in our original submission. [9, 10] seem to be available online after the submission deadline of this venue. We find them quite relevant and have now included them in the revision. H2O [8] realizes policy adaptation via adaptively penalizing the Q-function learning on simulated state-action pairs with large dynamics gaps. It focuses on the setting of online-offline cross-domain RL, i.e., the source domain is online while the target domain is offline. DARAIL [9] first trains the DARC method using both source domain environment and target domain environment, and then transfers the policy’s behavior from the source to the target domain through imitation learning from observation. It focuses on the setting where both the source domain and the target domain are online. RADT [10] addresses the off-dynamics scenario from the return-conditioned supervised learning (RCSL) perspective, where it augments the return in the source domain by aligning its distribution with that in the target domain. It aims at the cross-domain offline RL scenario.

The differences between OTDF and these methods can be summarized as below:

  • the experimental setting differs. OTDF focuses on offline policy adaptation where both the source domain and the target domain are offline. RADT also studies this setting, while H2O focuses on an online source domain and an offline target domain, and DARAIL requires both an online source domain and an online target domain.
  • DARAIL requires a comparatively large amount of online interactions with the target domain. For H2O and RADT, they both typically require a large amount of target domain offline data (either using 100% D4RL data or 10% D4RL data, corresponding to at least 100,000 target domain data). Instead, OTDF conducts experiments with very limited target domain data (5000 transitions)
  • OTDF leverages optimal transport for data filtering and dataset constraint to avoid the learned policy from getting biased towards the source domain dataset distribution. These techniques are not observed in methods like DARAIL.
评论

Concern 6: why use the different performance level target domain dataset from the source domain dataset in the empirical study

We respectfully argue that this is a reasonable setting for cross-domain policy adaptation because:

  • there is no definite rule that one can only conduct cross-domain offline policy adaptation for source domain offline datasets and target domain datasets with the same quality level. We note that even medium-level source domain dataset and medium-level target domain dataset can have large discrepancies. For example, the medium-level halfcheetah-kinematic dataset has an average return of about 2700 (please check Table 4 of our manuscript for details), while the average return of D4RL halfcheetah-medium-v2 is about 4800, indicating that their behaviors and patterns can be distinct. Using different performance-level target domain datasets from the source domain dataset can better capture the generality and effectiveness of the cross-domain offline RL methods
  • in many real-world applications, we often cannot provide exactly the same performance level offline dataset for both domains. For example, we may gather medium-level offline datasets in version A of a game (i.e., the source domain) with some medium-level RL policy. The company decides to modify some features of this game in the new season and constructs version B of this game (i.e., the target domain). The company hires some expert human players to play the new version of this game and log their play data. The logged human data is of expert quality but limited size. It is infeasible to train a strong AI simply with a limited budget of data. Then, we naturally would leverage medium-level source domain data to realize efficient offline policy adaptation. This example also applies to real-world robotics datasets.

Concern 7: results with other offline RL backbones

Interesting question! In our work, we use IQL as the backbone because (a) other baseline methods like IGDF, and DARA use IQL as the base algorithm, we then also use IQL to ensure a fair comparison; (b) IQL naturally satisfies the in-distribution constraint that does not query any OOD samples, which is a nice property to guarantee that the learned policy lie in the support region of the source domain dataset and the target domain dataset (such that term (a) and term (b) in Theorem 3.1 can be controlled); (c) No per dataset hyperparameter tuning is required for IQL; (d) IQL consumes less training time than other offline RL methods like CQL.

Nevertheless, we agree that it would be interesting to investigate how OTDF behaves if we use another base offline RL algorithm. To that end, we utilize TD3BC [11] as the base algorithm first for OTDF and conduct experiments on some selected tasks. We also utilize TD3BC as the base algorithm for DARA and IGDF. We do not use BCQ since it is also an in-sample learning approach akin to IQL, and it runs much slower than TD3BC. We cannot include all possible experiments and can only report part of them here (kinematic shift tasks and morphology shift tasks). Note that we only replace the base algorithm and do not tune any hyperparameters. We run all methods for 5 independent runs and summarize the average normalized score results in the target domain below. The results in Table 1 and Table 2 clearly show that OTDF enjoys strong performance with TD3BC as the base algorithm (competitive to OTDF with IQL as the base algorithm), and surpasses baselines like DARA and IGDF significantly on numerous tasks.

评论
SourceTargetOTDF (IQL)OTDF (TD3BC)DARA (TD3BC)IGDF (TD3BC)
half-mmedium40.2±\pm0.040.7±\pm1.330.7±\pm9.829.7±\pm7.8
half-mmedium-expert10.1±\pm4.013.0±\pm4.816.2±\pm4.95.2±\pm1.8
half-m-rmedium37.8±\pm2.141.7±\pm0.320.7±\pm10.321.5±\pm4.6
half-m-rmedium-expert9.7±\pm2.021.8±\pm6.67.4±\pm3.011.5±\pm5.3
half-m-emedium30.7±\pm9.640.3±\pm2.027.8±\pm3.228.2±\pm6.8
hopp-mmedium65.6±\pm1.959.2±\pm8.733.0±\pm18.826.9±\pm17.9
hopp-mmedium-expert55.4±\pm25.147.4±\pm19.332.1±\pm27.135.4±\pm22.9
hopp-m-rmedium35.5±\pm12.264.5±\pm2.844.0±\pm13.146.2±\pm7.7
hopp-m-rmedium-expert47.5±\pm14.654.7±\pm23.419.1±\pm7.922.0±\pm18.4
hopp-m-emedium65.3±\pm2.464.8±\pm1.851.1±\pm15.060.1±\pm5.7
walk-mmedium49.6±\pm18.055.1±\pm10.125.9±\pm8.640.0±\pm15.1
walk-mmedium-expert43.5±\pm16.424.8±\pm6.126.4±\pm4.814.0±\pm13.0
walk-m-rmedium49.7±\pm9.736.4±\pm16.318.0±\pm4.221.3±\pm7.2
walk-m-rmedium-expert55.9±\pm17.126.6±\pm15.219.1±\pm11.624.0±\pm11.5
walk-m-emedium44.6±\pm6.040.7±\pm14.729.2±\pm14.036.0±\pm13.4
ant-mmedium55.4±\pm0.051.6±\pm5.523.2±\pm4.029.1±\pm4.8
ant-mmedium-expert60.7±\pm3.649.8±\pm7.430.7±\pm8.126.7±\pm6.9
ant-m-rmedium52.8±\pm4.451.4±\pm3.324.4±\pm5.826.0±\pm5.0
ant-m-rmedium-expert54.2±\pm5.253.2±\pm6.030.5±\pm6.734.0±\pm8.0
ant-m-emedium50.2±\pm4.352.4±\pm4.823.5±\pm5.118.0±\pm11.7
Total913.9890.1533.0555.8

Table 1. Results on the kinematic tasks with TD3BC as the backbone for different cross-domain offline RL methods. half=halfcheetah, hopp=hopper, walk=walker2d, m=medium, r=replay, e=expert. We report the average normalized scores in the target domain in conjunction with the standard deviations.

SourceTargetOTDF (IQL)OTDF (TD3BC)DARA (TD3BC)IGDF (TD3BC)
half-mmedium39.1±\pm2.340.3±\pm1.840.0±\pm0.940.1±\pm0.5
half-mmedium-expert35.6±\pm0.735.5±\pm1.024.4±\pm8.228.5±\pm1.2
half-m-rmedium40.0±\pm1.242.4±\pm0.527.7±\pm6.329.6±\pm7.0
half-m-rmedium-expert34.4±\pm0.734.4±\pm4.116.0±\pm4.014.9±\pm5.2
half-m-emedium41.4±\pm0.339.9±\pm2.538.0±\pm2.438.4±\pm1.7
hopp-mmedium11.0±\pm0.927.3±\pm5.413.2±\pm0.313.2±\pm0.2
hopp-mmedium-expert12.6±\pm0.820.6±\pm14.312.3±\pm1.411.3±\pm2.6
hopp-m-rmedium8.7±\pm2.814.8±\pm4.89.3±\pm1.99.5±\pm2.1
hopp-m-rmedium-expert9.7±\pm2.718.2±\pm12.210.6±\pm1.010.3±\pm0.6
hopp-m-emedium7.9±\pm3.227.5±\pm12.915.6±\pm3.513.0±\pm1.7
walk-mmedium50.5±\pm5.83.1±\pm0.16.9±\pm2.416.5±\pm17.9
walk-mmedium-expert44.3±\pm23.852.2±\pm21.99.4±\pm3.312.5±\pm1.6
walk-m-rmedium37.4±\pm5.124.0±\pm21.121.7±\pm12.019.6±\pm10.8
walk-m-rmedium-expert33.8±\pm6.933.4±\pm8.11.3±\pm0.75.8±\pm2.5
walk-m-emedium49.9±\pm4.69.7±\pm13.28.0±\pm2.06.9±\pm1.5
ant-mmedium39.4±\pm1.742.8±\pm0.332.5±\pm3.420.2±\pm18.3
ant-mmedium-expert58.3±\pm8.955.2±\pm8.933.6±\pm4.327.0±\pm7.7
ant-m-rmedium41.2±\pm0.941.9±\pm0.934.4±\pm2.434.9±\pm1.4
ant-m-rmedium-expert50.8±\pm4.556.6±\pm3.831.3±\pm2.133.0±\pm4.9
ant-m-emedium39.9±\pm2.942.0±\pm1.133.1±\pm6.929.1±\pm6.8
Total685.9661.8419.3414.3

Table 2. Results on the morphology tasks with TD3BC as the backbone for different cross-domain offline RL methods. We report the average normalized scores in the target domain in conjunction with the standard deviations.

Furthermore, we adopt CQL [12] as the backbone algorithm for OTDF. Since CQL is very slow, we can only combine CQL with OTDF and summarize the results on some kinematic and morphology tasks in Table 3 and Table 4. We find that the performance of OTDF is comparatively inferior when adopting CQL as the backbone. The reason can be that we adopt a default hyperparameter setup for CQL (e.g., the penalty coefficient α\alpha is set to be 10.0 for all tasks, which can be large). It turns out that the base algorithm can have a large impact on the performance of OTDF and we recommend using IQL by default.

评论
SourceTargetOTDF (IQL)OTDF (CQL)
half-mmedium40.2±\pm0.022.1±\pm12.2
half-mmedium-expert10.1±\pm4.012.1±\pm12.6
half-m-rmedium37.8±\pm2.126.6±\pm11.8
half-m-rmedium-expert9.7±\pm2.029.8±\pm5.0
hopp-mmedium65.6±\pm1.965.4±\pm0.7
hopp-mmedium-expert55.4±\pm25.125.0±\pm18.1
hopp-m-rmedium35.5±\pm12.237.5±\pm13.9
hopp-m-rmedium-expert47.5±\pm14.639.0±\pm15.5
walk-mmedium49.6±\pm18.032.7±\pm8.9
walk-mmedium-expert43.5±\pm16.438.5±\pm17.4
walk-m-rmedium49.7±\pm9.725.5±\pm9.3
walk-m-rmedium-expert55.9±\pm17.127.9±\pm10.0
ant-mmedium55.4±\pm0.049.3±\pm6.0
ant-mmedium-expert60.7±\pm3.656.7±\pm5.1
ant-m-rmedium52.8±\pm4.453.1±\pm5.4
ant-m-rmedium-expert54.2±\pm5.250.7±\pm4.7
Total723.6591.9

Table 3. Results on the kinematic tasks with CQL as the backbone algorithm. We report the average normalized scores in the target domain and ±\pm captures the standard deviations.

SourceTargetOTDF (IQL)OTDF (CQL)
half-mmedium39.1±\pm2.340.2±\pm0.9
half-mmedium-expert35.6±\pm0.734.1±\pm1.8
half-m-rmedium40.0±\pm1.241.5±\pm0.3
half-m-rmedium-expert34.4±\pm0.729.7±\pm5.7
hopp-mmedium11.0±\pm0.913.4±\pm0.5
hopp-mmedium-expert12.6±\pm0.813.7±\pm0.5
hopp-m-rmedium8.7±\pm2.84.3±\pm3.0
hopp-m-rmedium-expert9.7±\pm2.712.1±\pm4.6
walk-mmedium50.5±\pm5.836.6±\pm10.5
walk-mmedium-expert44.3±\pm23.824.1±\pm3.7
walk-m-rmedium37.4±\pm5.135.4±\pm7.9
walk-m-rmedium-expert33.8±\pm6.912.4±\pm5.5
ant-mmedium39.4±\pm1.739.8±\pm2.6
ant-mmedium-expert58.3±\pm8.953.0±\pm3.0
ant-m-rmedium41.2±\pm0.940.7±\pm1.9
ant-m-rmedium-expert50.8±\pm4.550.7±\pm1.7
Total546.8481.7

Table 4. Results on the morphology tasks with CQL as the backbone algorithm. We report the average normalized scores in the target domain and ±\pm captures the standard deviations.

Hopefully, these can resolve the concerns. If there is still something unclear, please let us know!

[1] Dara: Dynamics-aware reward augmentation in offline reinforcement learning

[2] Contrastive representation for data filtering in cross-domain offline reinforcement learning

[3] Beyond ood state actions: Supported cross-domain offline reinforcement learning

[4] Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

[5] Off-policy deep reinforcement learning without exploration

[6] Supported policy optimization for offline reinforcement learning

[7] Plas: Latent action space for offline reinforcement learning

[8] When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning

[9] Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation

[10] Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

[11] A minimalist approach to offline reinforcement learning

[12] Conservative q-learning for offline reinforcement learning

评论

Dear Reviewer i1Jo, thanks for your thoughtful review. As the author-reviewer discussion period is near its end, we wonder if our rebuttal addresses your concerns. Please let us know if there is anything unclear!

评论

Dear Reviewer i1Jo, we deeply appreciate your thoughtful review and your time, and hope that our response can address your concerns. We would like to kindly confirm if you still have any concerns or questions. We are more than happy to have further discussions with the reviewer if possible!

审稿意见
6

This paper addresses the challenge of cross-domain offline reinforcement learning (RL) where the goal is to leverage a source domain dataset to improve policy learning in a target domain with limited data. This paper identifies that directly merging data from different domains can lead to performance degradation due to dynamics mismatches. To mitigate this, they propose a method called Optimal Transport Data Filtering (OTDF), which aligns transitions from source and target domains using optimal transport, selectively shares source domain samples, and introduces a dataset regularization term to keep the learned policy within the scope of the target domain dataset. The effectiveness of OTDF is evaluated across various dynamics shift conditions with limited target domain data, demonstrating superior performance over strong baselines.

优点

  1. This paper introduces a novel approach to cross-domain offline RL by combining optimal transport with dataset regularization, which is a unique combination not commonly seen in the literature.
  2. The authors provide a solid theoretical foundation for their method, characterizing the performance bound of a policy and motivating their approach with theoretical insights

缺点

  1. In previous studies, DARA, BOSA, and IGDF all utilize source and target domain data of similar quality, implying that the state space distributions of the datasets were closely aligned. This alignment ensures, to some extent, that there is a certain overlap between the source and target domain data. Your setting differs from theirs, which naturally raises the question of whether data filtering or policy regularization played the critical role. As the results in Figure 3 indicate, the introduction of policy regularization has a highly significant impact on the performance of the algorithm. However, in previous methods such as SRPO and IGDF, constraints related to this were not designed, which may account for the significant performance improvement observed in OTDF.

  2. In the paper "Cross-domain imitation learning via optimal transport, ICLR 2022," some methods for cross-domain transfer based on optimal transport have already been adopted. What do you think are the differences between your method and the previous methods in terms of implementation?

问题

  1. Table 1 is too wide and not pleasing enough. Recommend using \resizebox {\linewidth} {!} to fix it.
  2. Why is the performance of IQL not assessed when using only source domain data?
  3. The paper does not provide the impact of using OT for data filtering or subsequent policy regularization on the performance bound of OTDF. A sub-optimal gap is expected to be given.
评论

We thank the reviewer for the constructive comments. We provide point-to-point clarification to the concerns of the reviewer. If we are able to resolve some concerns, we hope that the reviewer will be willing to raise the score.

Concern 1: on the experiment setting

Thanks for the question. We use source and target domain data of distinct qualities because (a) we can better capture the generality and effectiveness of the cross-domain offline RL methods and there is no definite rule that one can only conduct cross-domain offline policy adaptation for source domain offline datasets and target domain datasets with the same quality level; (b) in many real-world applications, we often cannot provide exactly the same performance level offline dataset for both domains.

We agree that there is a certain overlap between the source and target domain data to some extent when they are of similar quality (it is hard to quantify that though). Nevertheless, we respectfully argue that they can also have some overlap even when they are of different qualities since they are still striving towards the same goal. For example, the robot aims at walking forward in the walker2d task. The expert-level walker2d dataset with kinematic shift can also share some overlap with the medium-level source domain dataset (e.g., some walking movements and patterns). Furthermore, when one adopts a medium-level target domain dataset, it can at least share overlap with medium/medium-replay/medium-expert source domain datasets since they both contain medium-level data. It also applies to target domain datasets with medium-expert quality and expert quality. Hence, we believe that it is reasonable to adopt a setting like this. Moreover, OTDF still significantly outperforms baseline methods like IGDF and SRPO across various kinds of dynamics shifts when the source domain dataset and target domain dataset have the same performance level (e.g., medium-level source domain data and medium-level target domain data), as depicted in Table 1, Table 2 and Table 6.

We further clarify that both data filtering based on optimal transport and policy regularization indeed play the critical role. We reiterate that our method is fully motivated by the theoretical results in Theorem 3.1 and the experimental setting where we only have a limited budget of target domain data. Data filtering effectively controls the dynamics mismatch term (c) in Theorem 3.1 by keeping source domain data that have similar transition dynamics as the target domain dynamics (as discussed in Lines 160-165). Optimal transport is a principled approach for comparing two distributions and is ideal to be applied for reliable data filtering due to its training-free nature, given that we only have limited target domain data. Furthermore, the dataset constraint objective better controls the target policy deviation term (b) in Theorem 3.1 and ensures that the learned policy would not get biased towards the source domain dataset distribution considering that the target domain dataset size is limited. The parameter study in Figure 2 and Figure 3 show that excluding either data filtering (ξ=100\xi=100) or dataset constraint (β=0\beta=0) often incurs a performance drop, indicating the necessity and effectiveness of both components to OTDF. The influence of data filtering can also be large for many tasks. We agree that the introduction of the dataset constraint term is important to OTDF and previous methods like SRPO and IGDF do not include it, but this component is still our novel contribution.

Concern 2: differences between our work and some prior works

Thanks for recommending GWIL [1], which we have actually included in our original submission. Despite that we both utilize optimal transport in the cross-domain setting, our method differs from GWIL [1] in the following aspects:

  • the motivations are different. GWIL [1] solves the cross-domain imitation learning problem from a new perspective by directly formalizing it as an optimal transport problem. Our method, instead, utilizes optimal transport (OT) due to (a) OT is a principled and widely used approach for comparing two distributions; (b) we only have a very limited budget of target domain data, and OT can work given such low data regime thanks to its training-free nature.
  • the experimental settings are different. GWIL focuses on the cross-domain imitation learning scenario, where expert demonstrations are needed. Our OTDF addresses the cross-domain offline RL problems with different performance level offline source domain or target domain datasets (no expert demonstration is required).
  • the core methods are different. GWIL leverages the optimal transport for constructing pseudo-rewards while OTDF utilizes the optimal transport for data filtering (i.e., keeping source domain data that have similar transition dynamics as the target domain dynamics)

[1] Cross-domain imitation learning via optimal transport

评论

Concern 3: Table 1 is too wide and not pleasing enough

Thanks for the comment. We have modified all tables in our manuscript to ensure that they look pleasing.

Concern 4: Why is the performance of IQL not assessed when using only source domain data?

We do not report the performance of IQL with only source domain data since we care more about the agent's performance in the target domain. Considering the fact that there exists a dynamics gap between the source domain and the target domain, directly deploying IQL trained with only source domain data may incur inferior performance in the target domain. We hence choose to report IQL*, which trains IQL using both the source domain data and target domain data. Since the reviewer asks, we conduct experiments of IQL (source only) on some tasks with kinematic shifts and morphology shifts. We report the average normalized scores across 5 seeds and summarize the results below. Note that the performance of IQL (source only) remains identical across different target domain dataset qualities since it is trained using only source domain data. We find that IQL with only source domain data often incurs worse performance than OTDF, and underperforms IQL*, indicating that it may not be a meaningful baseline.

SourceTargetIQL (source only)IQL*OTDF
half-mmedium14.7±\pm2.012.3±\pm1.240.2±\pm0.0
half-mmedium-expert14.7±\pm2.010.8±\pm1.910.1±\pm4.0
half-m-rmedium14.5±\pm2.110.0±\pm5.437.8±\pm2.1
half-m-rmedium-expert14.5±\pm2.16.5±\pm3.19.7±\pm2.0
half-m-emedium13.1±\pm3.421.8±\pm6.530.7±\pm9.6
half-m-emedium-expert13.1±\pm3.47.6±\pm1.410.9±\pm4.2
hopp-mmedium31.5±\pm9.258.7±\pm8.465.6±\pm1.9
hopp-mmedium-expert31.5±\pm9.268.5±\pm12.455.4±\pm25.1
hopp-m-rmedium21.0±\pm15.736.0±\pm0.135.5±\pm12.2
hopp-m-rmedium-expert21.0±\pm15.736.1±\pm0.147.5±\pm14.6
hopp-m-emedium8.7±\pm6.866.0±\pm0.565.3±\pm2.4
hopp-m-emedium-expert8.7±\pm6.845.1±\pm15.738.6±\pm15.9
walk-mmedium40.0±\pm15.634.3±\pm9.849.6±\pm18.0
walk-mmedium-expert40.0±\pm15.630.2±\pm12.543.5±\pm16.4
walk-m-rmedium12.5±\pm5.111.5±\pm7.149.7±\pm9.7
walk-m-rmedium-expert12.5±\pm5.19.7±\pm3.855.9±\pm17.1
walk-m-emedium24.1±\pm12.241.8±\pm8.844.6±\pm6.0
walk-m-emedium-expert24.1±\pm12.222.2±\pm8.716.5±\pm7.2
ant-mmedium19.5±\pm4.250.0±\pm5.655.4±\pm0.0
ant-mmedium-expert19.5±\pm4.257.8±\pm7.260.7±\pm3.6
ant-m-rmedium22.3±\pm0.543.7±\pm4.652.8±\pm4.4
ant-m-rmedium-expert22.3±\pm0.536.5±\pm5.954.2±\pm5.2
ant-m-emedium16.4±\pm2.549.5±\pm4.150.2±\pm4.3
ant-m-emedium-expert16.4±\pm2.537.2±\pm2.048.8±\pm2.7
Total476.6803.81029.2

Table 1. Results on kinematic tasks. IQL* denotes that IQL is trained with both the source domain dataset and target domain dataset. half=halfcheetah, hopp=hopper, walk=walker2d, m=medium, r=replay, e=expert. We report the mean normalized scores along with their standard deviations.

评论
SourceTargetIQL (source only)IQL*OTDF
half-mmedium5.6±\pm0.830.0±\pm1.639.1±\pm2.3
half-mmedium-expert5.6±\pm0.831.8±\pm1.135.6±\pm0.7
half-m-rmedium6.5±\pm0.430.8±\pm4.440.0±\pm1.2
half-m-rmedium-expert6.5±\pm0.412.9±\pm2.234.4±\pm0.7
half-m-emedium5.5±\pm0.541.5±\pm0.141.4±\pm0.3
half-m-emedium-expert5.5±\pm0.525.8±\pm2.035.1±\pm0.6
hopp-mmedium13.2±\pm0.113.5±\pm0.211.0±\pm0.9
hopp-mmedium-expert13.2±\pm0.113.4±\pm0.112.6±\pm0.8
hopp-m-rmedium12.0±\pm0.210.8±\pm1.18.7±\pm2.8
hopp-m-rmedium-expert12.0±\pm0.211.6±\pm1.69.7±\pm2.7
hopp-m-emedium10.3±\pm2.812.6±\pm1.47.9±\pm3.2
hopp-m-emedium-expert10.3±\pm2.814.1±\pm1.39.6±\pm3.5
walk-mmedium13.2±\pm1.723.0±\pm4.750.5±\pm5.8
walk-mmedium-expert13.2±\pm1.721.5±\pm8.644.3±\pm23.8
walk-m-rmedium10.7±\pm1.611.3±\pm3.037.4±\pm5.1
walk-m-rmedium-expert10.7±\pm1.67.0±\pm1.533.8±\pm6.9
walk-m-emedium13.6±\pm6.724.1±\pm7.449.9±\pm4.6
walk-m-emedium-expert13.6±\pm6.727.0±\pm5.540.5±\pm11.0
ant-mmedium32.7±\pm1.538.7±\pm3.839.4±\pm1.7
ant-mmedium-expert32.7±\pm1.547.0±\pm5.158.3±\pm8.9
ant-m-rmedium30.3±\pm2.838.2±\pm2.941.2±\pm0.9
ant-m-rmedium-expert30.3±\pm2.838.1±\pm3.550.8±\pm4.5
ant-m-emedium30.6±\pm3.432.9±\pm5.139.9±\pm2.9
ant-m-emedium-expert30.6±\pm3.435.7±\pm3.965.7±\pm4.5
Total368.4593.3836.8

Table 2. Results on morphology tasks. We report the mean normalized scores along with their standard deviations.

Concern 5: the impact of using OT for data filtering or subsequent policy regularization on the performance bound of OTDF

We respectfully clarify that both data filtering with optimal transport and policy regularization are motivated by the theoretical results in Theorem 3.1 as clarified in our responses to Concern 1 above. Their impact on the performance bound can actually be qualitatively seen in Theorem 3.1, i.e., when using optimal transport for data filtering, we keep source domain data that is close to the target domain data, which indicates that we actually modify the empirical distribution of the source domain dataset M^_src\widehat{\mathcal{M}}\_{\rm src} to become closer to the target domain, i.e., term (c) D_TV(P_M_tarP_M^_src)D\_{\rm TV}(P\_{\mathcal{M}\_{\rm tar}}\| P\_{\widehat{\mathcal{M}}\_{\rm src}}) in Theorem 3.1 can be better controlled. Policy regularization helps control term (b) D_TV(π_D_tarπ)D\_{\rm TV}(\pi\_{D\_{\rm tar}}\|\pi). A rigorous and quantitative sub-optimality gap after introducing data filtering and policy regularization can be difficult to derive (to the best of our knowledge, none of the previous cross-domain offline RL papers successfully build such sub-optimality gaps based on their method yet) and may need some strong assumptions. These can yield gaps between theoretical analysis and practical application. We hence are sorry for not being able to provide the expected sub-optimality gaps but would be eager to explore that in future works.

Hopefully, these can resolve the concerns. If there is still something unclear, please let us know!

评论

Dear Reviewer 6ito, thanks for your time and efforts in making our paper better. Since the deadline of Author-Reviewer discussion period draws near, we wonder if you can kindly check our rebuttal and see if our responses mitigate your concerns. We would appreciate it if you could give us some feedback, and we are ready to have further discussions with the reviewer if there is anything unclear.

评论

Dear Reviewer 6ito, thank you for your helpful review! We would like to double-check if our response can address your concerns. Please do not hesitate to let us know if you still have any concerns or questions. We would appreciate it if the reviewer could re-evaluate our work based on the revised manuscript and the attached rebuttal. We are looking forward to your kind reply!

评论

Thank you for your response! The concerns have essentially been addressed, and as a result, we have raised the score to 6.

评论

We are glad that the reviewer's concerns are addressed. We thank the reviewer for raising the score to 6! Thanks for your time and efforts in making our paper better.

评论

Dear reviewers,

Thanks for your time in reviewing our paper and valuable advice in making our paper better. We have uploaded a revision of our paper where we:

  • cited some related works on off-dynamics RL recommended by Reviewer i1Jo
  • fixed all the tables in our paper to ensure that they look pleasing and are not too wide, as recommended by Reviewer 6ito
  • clarified the notation of μt,t\mu_{t,t^\prime} as concerned by Reviewer vhAr

All modifications are highlighted in green. We welcome any suggestions from the reviewer and are pleased to have further discussions with the reviewers.

Best,

Submission 4658 authors

AC 元评审

This paper develops a procedure to sub-sample source data collected from many different tasks to build a reinforcement-learned (RL) policy on the target task using a small amount of offline data. The sub-sampling is conducted using optimal transport to calculate a propensity score for the behavior policy on the target data. The paper discusses results of this approach on standard offline RL problems using MuJoCo simulation environments.

This paper received two substantial reviews. The authors have done a remarkable job of addressing the concerns of the reviewers in the rebuttal, through new experiments, and detailed explanations. I recommend this paper for an accept. I would suggest the authors to incorporate the discourse from the rebuttal phase into the main paper. I would also suggest them to be more careful in making bold the numbers in Table 1 and 2, it is reasonable to make an entry bold only if it is statistically better than the other columns (e.g., with a p-value of 0.05). Since both the expert and random policies come with very high variance in these environments, the normalized score reported in these tables has a large variance---one could compute it more rigorously.

审稿人讨论附加意见

Reviewer i1Jo was concerned about the computational overhead of building a conditional VAE to compute the propensity score of the behavior policy for the target data, ablation studies, and related work. The authors have addressed these comments elaborately in their rebuttal.

Reviewer 6ito was concerned as to why the authors use source data of a substantially different quality than the target data (e.g., source data is from a random behavior policy and target data is from an expert). The authors have addressed this concern satisfactorily. The reviewer also pointed out to an older work that exactly uses optimal transport for cross-domain adaptation in the context of imitation learning---the present paper is (marginally, in my opinion) different because it is working with offline reinforcement learning. Altogether, the review was satisfied with this response and raised their score.

Reviewer vhAr provided a very superficial review with a low confidence and a high score. I suggest ignoring this review.

It was difficult to get the reviewers to engage more with this paper despite some efforts. But the authors have provided a comprehensive rebuttal which helps understand the merits and deficiencies of this work. My recommendation is also based on my own reading of the paper.

最终决定

Accept (Poster)