DITTO: Offline Imitation Learning with World Models
A novel imitation learning approach using RL with an intrinsic reward defined in the latent space of a learned world model.
摘要
评审与讨论
DITTO is a world-model-based approach for imitation learning. DITTO uses an intrinsic reward to learn to imitate expert trajectories. The intrinsic reward adopted is a distance between the states reached by the learned policy and the ones reached by the expert policy in the expert data, evaluated on the latent space trajectories of the world model. This strategy allows learning on-policy (in the world model's imagination).
优点
- Motivation: the approach is well-motivated, as covariate shift and learning from high-dimensional demonstrations are well-established problems. The authors also report theoretical results from the literature, providing appropriate references, to theoretically justify their approach.
缺点
- Novelty: the approach novelty is limited, as it combines the idea of performing imitation learning through some form of state matching/inverse reinforcement learning, which is not novel [1], with the idea of using world models to learn behaviour in imagination for high-dimensional inputs, which has become more and more popular in the last few years.
- Evaluation: the results are limited to 6 environments (1 is in the Appendix). The authors compare baselines in terms of obtained rewards. In the first place, this choice is arguable given that none of the approaches is optimizing to maximize rewards. In the second place, it is not clear why some approaches perform better than the Expert. In general, I would expect that well-behaved approaches' performance would obtain similar performance to the expert and not outperform it. Some examples of how to evaluate similarity to the expert are provided in [1].
[1] f-IRL: Inverse Reinforcement Learning via State Marginal Matching, Ni et al, 2020
问题
- The authors use
sboth for indicating the POMDP and the world model latent states. These spaces can be completely different, both in dimensionality and in the information they contain, and thus it can be misleading to indicate them with the same letter. I recommend the authors to update their use of notation. - The presentation could be improved in several ways. There are some minor typos (see Questions section of the review), Figure 1 is not very informative (could show better how the world model and the expert trajectories are employed), Figure 3 has an unusual layout with two plots being much larger than the others, and the related work contains equations that are not completely relevant (e.g. Equation 4).
- In order for the work to be accepted, I think the evaluation should be improved and better reflect the motivating scenario. I suggest the authors complete the story with more experiments and baselines. I also recommend the authors look into more adequate metrics of comparison that better support the motivation of this work, e.g. that show that DITTO reduces covariance shift (also see comments in Weaknesses)
Some typos/writing suggestions:
- there is a repetition in the abstract:
in the latent space of a learned world modelappears twice in two consecutive sentences - typo in the introduction:
the both the strong - in Figure 1 caption :
the intrinsic reward 8
[Update] I would like to thank the authors for replying to my review. Unfortunately, they did not address my concerns.
The paper you point out, f-IRL, evaluates on 4 state-based environments, so we don't understand why 6 pixel-based environments is insufficient to substantiate our claims.
My comment was not explicitly aimed at comparing your work with f-IRL.
I think the evaluation of DITTO remains insufficient:
- the experiments show results that are difficult to interpret (e.g. why some baselines perform better than the Expert?)
- the number of environments and baselines adopted is limited. Since the authors seem to disagree with this claim, I refer them to two works which show more extensive evaluations [1,2], and many more can be easily found searching online (especially in the offline RL literature)
- if the idea was to compare only against imitation learning approaches, different metrics to evaluate covariate shift should be adopted. I think the latent distance metrics is useful to provide additional insights, but cannot be used as a general tool to compare different approaches, as it is strongly dependent on the world model.
Given the above, my previous comments, the similar issues that have been raised by the other reviewers, and that the authors did not address most of the concerns raised, I feel more confident about rejecting the paper at the current status.
I think there are some interesting points in the paper, and so I strongly recommend the authors improve the structure and quality of the presentation, and improve their evaluation for future submissions, in order to make their work a more valuable contribution to the community.
[1] Offline Reinforcement Learning from Images with Latent Space Models, Rafailov et al
[2] Implicit Behavioral Cloning, Florence et al
It's true that our approach is related to the state divergence minimization literature, as we discuss in the paper. Our contribution is to provide a simple and performant instantiation of divergence minimization imitation learning, with the twist that we can do it through simple on-policy RL in a learned world model.
The paper you point out, f-IRL, evaluates on 4 state-based environments, so we don't understand why 6 pixel-based environments is insufficient to substantiate our claims.
RE: Appropriate metrics of covariate shift. This is why we included the latent distance plots, to point out how typical metrics such as action prediction accuracy fail to be predictive of the expected environment return, whereas our latent distance metric is predictive.
Thank you for pointing out the writing issues, we will address these.
This paper studies offline imitation learning from pixels, which is actually an interesting and appealing direction. As a result, the authors propose an algorithm dubbed DITTO, which optimizes the distance measurement defined in the latent space of the learned world model. The authors propose to match the latent state of the learner and the expert. Instead of the minimization over the expert dataset, the authors match the latent state of the learner and the expert at the same time-step. The whole procedure of DITTO is composed of two phases, (a) train the world model in the demonstration datasets (b) the authors encode expert demonstrations into the world model latent space, and use the on-policy reinforcement learning algorithm to optimize the intrinsic reward. They conduct several experiments on pixel-based tasks, including discrete control tasks like Atari, and continuous control tasks like BipedalWalker.
优点
-
the studied topic is interesting and important to the community
-
the reproducibility of this paper is kind of good, and I believe the results provided in this paper are reproducible
-
this paper is easy to follow and understand
缺点
However, I also have some concerns about this paper. Please refer to the following comments.
-
(major) This paper does not have a clear and friendly structure for the readers. Many of the key formulas (e.g., ELBO objective) are placed in the Appendix, while the authors spend many spaces on related work part. The authors state that they theoretically show that their formulation induces a divergence bound between the expert and the learner. However, no formal theorem/lemma/proposition is presented in the paper. The core theoretical result in Eq 6 is based on the previous conclusion. Also, what is the purpose and the role of Appendix A? What conclusion can be derived based on corollary A.1?
-
(major) The evaluations are limited. The authors only conduct experiments on 5 Atari tasks and 1 continuous control task. I do not think these are sufficient to show the generality and the effectiveness of their method. More experiments on both the discrete control domain and the continuous control domain can definitely make this paper stronger.
-
(major) potential issues with the baselines. The authors only compare their proposed DITTO method against some comparatively weak baselines like BC, D-GAIL, etc. As far as the reviewer can tell, there are some imitation learning methods that can achieve quite good performance on the Atari games, e.g., IQ-learn [1]. So, how does DITTO compete against it? Another recent offline imitation learning algorithm OTR [2] computes rewards using the optimal transport, so how does DITTO compete against OTR? Intuitively, we can choose not to use the world model and instead the optimal transport to calculate the rewards for the downstream RL algorithm. It is unclear whether DITTO can beat OTR in this scenario.
[1] Iq-learn: Inverse soft-q learning for imitation. NeurIPS.
[2] Optimal transport for offline imitation learning. ICLR.
- (major) It is unclear how different choices of the base algorithm affect the learning efficiency and the final performance. The authors only adopt simple actor-critic architecture with the actor being updated using the REINFORCE algorithm and the critic being updated with the -return. A valuable and necessary study is, how different base algorithms affect the performance of DITTO.
Based on the above considerations, I think this paper is under the acceptance bar of this venue.
- (minor) The author state in Section 2.1 that
These methods are effective when low-dimensional proprioceptive states and actions are available but have not yet demonstrated strong performance in high-dimensional observation environments
I do not think it is difficult to extend these algorithms into the pixel-based variants, do you think these methods can have poor performance in the pixel-based tasks, given that we introduce an encoder to these algorithms?
======================
Post Rebuttal Comments
Thanks for your rebuttal. It is a pity that you post your rebuttal quite late, and do not address my questions. I appreciate your efforts in this work and have the following suggestions that may be helpful for you in improving your paper.
-
Please re-organize your paper. As I commented, this paper does not have a friendly structure for the readers in its current version. Some critical formulas ought to be placed in the main text. The authors can put some of the related work part in the appendix.
-
The authors only conducted one experiment in the continuous control domain. If you would like to claim that DITTO can also work in the continuous control domain, please include more experiments on continuous control tasks. Otherwise, please just focus on the discrete control domain.
-
If the corollary in the appendix is one of your contributions, it is better to put it in the main text and describe how this corollary contributes to your work and what insights can we learn from this theorem.
-
If possible, please include some stronger and more recent baselines.
问题
-
Can the authors also compare the computation cost of their method against baselines?
-
No limitations are stated in this paper. It is important to acknowledge both the advantages of your method and disadvantages of your method
RE: "Many of the key formulas (e.g., ELBO objective) are placed in the Appendix, while the authors spend many spaces on related work part" We placed these details in the appendix because they are not novel, the world model is exactly that introduced in the Dreamerv2 paper.
RE: "However, no formal theorem/lemma/proposition is presented in the paper." This is incorrect. Appendix A gives a proof of the return bound.
RE: "The evaluations are limited. The authors only conduct experiments on 5 Atari tasks and 1 continuous control task." Frankly, most imitation learning papers report results on 3-4 low-dimensional state-based environments. We are surprised by this pushback, we feel our evaluation is comprehensive and in more difficult environments than many related works.
RE: IQ-learn and OTR. Thanks for pointing these out. It appears IQ-learn uses online interactions in the Atari environments? They are only fully offline in the state-based environments, if we've understood their paper correctly? It's difficult to directly compare since the experts we imitate have different performance profiles across the environments, but their results indeed look comparable, except for the online interaction part. The OTR paper is also interesting, but it appears they don't test in visual environments, only state-based. Indeed, the difficulty is getting a useful measure of state in visual domains to do some kind of divergence minimization or optimal transport approach. It would be interesting to run an OT approach on the world-model latent space! It seems people think visual imitation learning is trivial, but if you look at many of these papers, they do not test in pixel-based environments.
RE: Lambda return and inner-RL details: These are exactly the Dreamer policy learning procedure, there's nothing exotic or particular here. Indeed we imagine that using more modern approaches such as PPO for the inner RL would probably result in a stronger agent. But we are focusing here on the conceptual problem of doing imitation using world models, not the RL details in the inner loop. The fact that DITTO is this strong without an optimized inner RL loop is probably a good sign.
RE: "I do not think it is difficult to extend these algorithms into the pixel-based variants, do you think these methods can have poor performance in the pixel-based tasks, given that we introduce an encoder to these algorithms?" Well, yes. If you look at SQIL and DRIL, you'll see that adapting standard algorithms like GAIL to the visual setting results in terrible performance. Pixel-based imitation learning is understudied. We hope our work helps to highlight this area of research.
This paper proposes an offline imitation learning method that imitates expert in the latent state space. The paper shows that the latent space of a learned world model can provide measure of state divergence, and measure imitation performance without access to an extrinsic reward function. The method first trains a world model, and then optimizes the RL agent by minimizing the latent state distance between the learned policy and the expert. The paper evaluates DITTO on challenging Atari environments from pixels, and shows that it outperforms existing offline imitation learning methods and achieves expert performance with minimal data. The paper also adapts two baseline methods, BC and GAIL, to the world model setting, and demonstrates that they benefit from the model-based approach but are still inferior to DITTO.
优点
The novel intrinsic reward based on the latent distance is interesting, and the paper presents that maximizing this reward induces imitation learning and bounds the extrinsic reward difference. I appreciate authors' effort to justify the effectiveness of the heuristic reward in Appendix A. Besides, the paper provides a clear definition of the problem and the motivation seems correct.
缺点
- A significant issue I concern is the contribution of the study. DITTO is built based on Dreamer [1], which also trains policy in the latent space. The difference is that Dreamer does not imitate the offline data. Besides, the proposed contributions in the Introduction section, i.e., D-BC and D-GAIL are straightforward.
- The literature review is inadequate. For example, there lacks a discussion about works within offline imitation learning domain, e.g., [2,3].
- The paper is not well-organized. Some important details are missed in the main body, e.g., the training objective of the world model.
- Some offline model-based RL baselines are missed [4]. Besides, some IL baselines that handles high dimensional state should be considered, such as AIRL [5] and DICE [6].
References above
- [1] Dream to Control: Learning Behaviors by Latent Imagination. Hafner et al.
- [2] Offline imitation learning with a misspecified simulator. Jiang et al.
- [3] Curriculum Offline Imitation Learning. Liu et al.
- [4] Offline model-based adaptable policy learning. Chen et al.
- [5] Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. Fu et al.
- [6] Imitation Learning via Off-Policy Distribution Matching. Kostrikov et al.
问题
The paper does not provide any ablation studies or experiments to evaluate the impact of different components or design choices of DITTO. For example, how does changing the reward function, the distance measure, or the horizon length affect the results? How does DITTO compare with other model-based reinforcement learning methods that do not use imitation learning? How does DITTO perform on different types of environments, such as continuous control or navigation tasks?
I feel that the paper lacks analysis or discussion on the quality and robustness of the learned world model. How well does the world model capture the dynamics of the true environment? And how does DITTO handle situations where the world model fails to generate realistic or consistent latent trajectories?
Regarding the literature review being inadequate: your mentioned [2] and [3] evaluate on 3, and 4 low-dimensional state-based environments, respectively. We evaluate on 6 pixel-based environments. There are indeed many related works in the imitation learning literature, but to maintain focus on the hard aspects of partial observability which we would like to tackle, we do not consider works which have so far only demonstrated success in state-based environments. Instead, we adapt mature methods like BC and GAIL to the world model setting. If you compare our adapted baselines to recent methods, you'll find they perform competitively.
RE: "The paper is not well-organized. Some important details are missed in the main body, e.g., the training objective of the world model." Since these details are exactly identical to prior works (adapted straight from Dreamer), we do not consider them significant contributions and placed them in the Appendix.
RE: "Some offline model-based RL baselines are missed" We are not doing offline RL. There are no rewards in our trajectories, so these methods simply do not apply.
The DICE [1] paper evaluates on 4 state-based environments. Again, we evaluate on 6 pixel-based environments, and do not compare with that method since it hasn't been demonstrated in this setting. AIRL is essentially a variant of GAIL, and evaluates again only on low-dimensional state-based environments.
RE: "How does DITTO compare with other model-based reinforcement learning methods that do not use imitation learning?" DITTO does not address the same problem as MBRL methods.
RE: "How does DITTO perform on different types of environments, such as continuous control or navigation tasks?" We included results on a continuous control task in the appendix - bipedal walked from pixels alone. Please take a look!
[1]: I. Kostrikov et al. Imitation Learning via Off-Policy Distribution Matching. In ICLR, 2020.
Thanks for your response.
Now I can see your setting focuses on offline IL without reward function. However, as DITTO has strong relation with Dreamer, I still hold the opinion there should be a comparison. Besides, the baselines are too straightforward. Maybe you can consider other proper, recent offline imitation learning baselines.
Overall, some of my concerns have been addressed. I am still with reservations as the experiments, paper organization can be further improved.
This paper introduces DITTO, a novel imitation learning algorithm that leverages a learned world model for offline training in a latent space. DITTO mitigates covariate shift and provides a measure of state divergence, enhancing the evaluation and optimization of imitation performance. Empirical results demonstrate that DITTO outperforms state-of-the-art methods in offline settings on challenging Atari environments from pixels, and the paper provides theoretical guarantees on the quality of imitation and generalization capability.
优点
- The paper leverages world model learning to address common problems in imitation learning, showing how the latent space of a learned world model provides a rich representation of environment dynamics for imitation learning.
- The paper introduces extensions of baseline imitation learning algorithms to the world model setting, demonstrating that DITTO achieves state-of-the-art performance on challenging Atari environments.
缺点
- The novelty of the method seems limited, which looks like a simple combination of Dreamer and GAIL / BC.
- The experiment setup is not convincing, it is too simple and all results are only evaluated on a small set of simple tasks.
- There is some problems about the theoretical part.
问题
- It is a little confusing that you prove a bound for the reward function given in Eq. (7) but actually use Eq. (8). The reason stated in the paper is that Eq. (7) is computationally expensive, but there is no clear relation between the formulations of Eq. (7) and Eq. (8). Could you please explain how you design the distance function in Eq. (8)?
- Following this, it will be more convincing if you train DITTO with Eq. (7) in a toy environment, which does not require much computing resources.
- In Fig. 2, the latent distance is defined by , where is also defined by yourself. It might be unfair to compare this indicator. What are the results of more common distance measures? Or could you give reasons why this measure is better than other measures?
- Is there any ablation study for different reward functions? If training with Eq. (8) is exactly better than with other common distance measures, it can also partly answer Question 3.
伦理问题详情
- It is a little confusing that you prove a bound for the reward function given in Eq. (7) but actually use Eq. (8). The reason stated in the paper is that Eq. (7) is computationally expensive, but there is no clear relation between the formulations of Eq. (7) and Eq. (8). Could you please explain how you design the distance function in Eq. (8)?
- Following this, it will be more convincing if you train DITTO with Eq. (7) in a toy environment, which does not require much computing resources.
- In Fig. 2, the latent distance is defined by , where is also defined by yourself. It might be unfair to compare this indicator. What are the results of more common distance measures? Or could you give reasons why this measure is better than other measures?
- Is there any ablation study for different reward functions? If training with Eq. (8) is exactly better than with other common distance measures, it can also partly answer Question 3.
RE: "The novelty of the method seems limited, which looks like a simple combination of Dreamer and GAIL / BC." How so? GAIL uses a learned, adversarial setup for policy learning. BC simply clones actions. We discuss why both of these methods fail to produce robust imitation policies, and introduce a simple, novel intrinsic reward which can be optimized with off-the-shelf on-policy RL methods. We demonstrate that out method outperforms world-model augmented versions of GAIL and BC.
RE: "The experiment setup is not convincing, it is too simple and all results are only evaluated on a small set of simple tasks" These are standard environments for visual imitation learning. Please see e.g. [1][2], and compare their baseline results with ours. Most imitation learning papers evaluate on much simpler, state-based tasks.
RE: "there is no clear relation between the formulations of Eq. (7) and Eq. (8)" As we stated in the paper, the result holds for any distance function. We chose the particular form of equation 8 based on its empirical performance. The theoretical result does not depend on the details of the distance function. Please see the Appendix and the cited related work for details.
RE: "In Fig. 2, the latent distance is defined by 1−rint, where rint is also defined by yourself. It might be unfair to compare this indicator. What are the results of more common distance measures? Or could you give reasons why this measure is better than other measures?" The point here was to demonstrate that the latent distance measure we introduce is more predictive of final performance, compared to more common metrics of imitation performance, e.g. action prediction accuracy. We expect the on-policy latent distance to be more predictive of real world performance for reasons laid out in the paper, and we indeed find that to be the case.
RE: "Is there any ablation study for different reward functions? If training with Eq. (8) is exactly better than with other common distance measures, it can also partly answer Question 3." We would be happy to add ablations for other distance measures, e.g. L2, and so on. In early testing we found that this modified cosine-distance vastly outperformed all others. The focus of our work is not on the particular form of the distance function, but on the concept of minimizing on-policy state divergence using a learned model.
[1]: Kiante Brantley, Wen Sun, and Mikael Henaff. Disagreement-regularized imitation learning. In ICLR, 2020.
[2]: Siddharth Reddy, Anca D. Dragan, and Sergey Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards. In ICLR, 2020.
Hi authors,
Thanks for your clarification.
-
"The novelty of the method seems limited, which looks like a simple combination of Dreamer and GAIL / BC"
-
I am now aware of my misunderstanding. Now I am raising my score to 5.
-
"The experiment setup is not convincing".
-
The environments involved are limited. I think the author should try to combine more recent works rather than BC/GAIL, like [1-3]. And should include more continuous control results.
-
"In Fig. 2"
-
Not that convincing, since your method is exactly optimized towards that distance.
The writing of this paper could be further improved.
[1] Haldar S, Mathur V, Yarats D, et al. Watch and match: Supercharging imitation with regularized optimal transport[C]//Conference on Robot Learning. PMLR, 2023: 32-43. [2] Cohen S, Amos B, Deisenroth M P, et al. Imitation Learning from Pixel Observations for Continuous Control[C]//Deep RL Workshop NeurIPS 2021. 2021. [3] Liu M, He T, Zhang W, et al. Visual imitation learning with patch rewards[C]. ICLR 2023. 2023.
In this paper, an offline imitation learning method is proposed. The key idea lies in learning the environment model with expert demonstrations, and further doing reinforcement learning (RL) in the learned model. The proposed approach is tested under Atari games with pixel inputs, verifying its effectiveness under these environments.
优点
-
The motivation of the paper is reasonable: reducing the cost of doing online RL by learning offline in the model.
-
As far as I know, although model-based RL has a long history, the strategy of training environment model first has not been used in imitation learning.
缺点
- I am in wonder why the proposed method can handle the distribution shift issue in imitation learning. Please refer to Q1 below.
- The theoretical discussions in Sec. 3.1 are confusing to me. Please refer to Q2 below.
- The experimental results are not quite sufficient. There are several Atari games chosen in the experiments, but the reason for choosing them remain unclear. Why not report the results on more games? Furthermore, it is difficult to understand what kind of internal feature space that has been learned by the model. In my view, if learning the world model can address the distribution shift issue, one plausible reason is that the learned model feature space has strong generalization ability. While this remains unclear to me due to the missing of the analysis on the learned feature space.
问题
-
As described in the paper, one of the central challenges for imitation learning is that the learner can face significant different data distribution to expert's, since her policy can be very different from which the expert uses. While if we use expert demonstrations only to learn the environment model, why can't this model still be biased? If we don't allow the learner to interact with the environment, how can we build accurate on situations that the expert seldom faces?
-
I don't understand why Eq. 5 and Eq.6 are correlated. Furthermore, the "adaptation error" term in Eq. 6 captures the error of model learning, which is crucial for understanding the proposed approach. While further analysis is lacking in the paper.
-
We are not making any claims about the bias or generalization capabilities of the learned model itself. The point is that the return difference between expert and learner is bounded by the divergence of their induced distributions. Our method is about getting an estimate of the learner's distribution in the model, and explicitly penalizing the divergence from the expert over multiple time-steps, to address compounding error problems. We show that the world model latent space provides an excellent emasure of state distribution divergence. We show that minimizing this divergence using on-policy RL results in state-of-the-art performance in the offline setting (competitive with online methods)
-
Equations 5 and 6 are related in the following way: equation 5 gives a bound on the return difference in terms of the state distribution divergence. Equation 6 shows how the true state distribution divergence is bounded by the divergence in the model. Therefore, bounding the divergence in the model bounds the true state divergence, thereby bounding the return difference.
RE: "There are several Atari games chosen in the experiments, but the reason for choosing them remain unclear." As we stated in the paper, these are the benchmarks used in [1], a well-known SOTA method in the community, as well as other top-performing imitation learning methods, e.g. [2]. In addition to those Atari experiments, we also gave results in a pixel-based continuous control environment. Please see the appendix for those results.
RE: "it is difficult to understand what kind of internal feature space that has been learned by the model [...] one plausible reason is that the learned model feature space has strong generalization ability." Indeed - our results support this view.
[1]: Kiante Brantley, Wen Sun, and Mikael Henaff. Disagreement-regularized imitation learning. In ICLR, 2020.
[2]: Siddharth Reddy, Anca D. Dragan, and Sergey Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards. In ICLR, 2020.
This submission introduces an offline imitation learning algorithm, DITTO, consisting of a two-part training process: a world model trained on demonstrations of any quality and a policy using intrinsic rewards. The method is tested in an offline setting on Atari games, showing better performance than baselines, including Behavior Cloning and Generative Adversarial Imitation Learning.
优点
- The empirical results of DITTO are generally good.
- The method addresses an important problem in imitation learning and proposes an interesting solution by using a world model and intrinsic rewards.
- The submission is mostly well-written, clear, and easy to follow.
缺点
- More experiments are needed to establish the superiority of DITTO.
- The discussions and comparisons on the related works are not sufficient, especially for offline imitation learning and imitation learning from pixels.
- Offline imitation learning:
- Offline imitation learning with a misspecified simulator. NeurIPS 2020
- SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching. ICML 2022
- Discriminator-weighted offline imitation learning from suboptimal demonstrations. ICML 2022
- Imitation learning from pixels:
- Domain-Robust Visual Imitation Learning with Mutual Information Constraints. ICLR 2021
- Imitation Learning from Pixel-Level Demonstrations by HashReward. AAMAS 2021
- Visual Imitation Learning with Patch Rewards. ICLR 2023
- Offline imitation learning:
- The submission does not support the claim that it addresses the problem of covariate shift with sufficient evidence.
- The impact of the quality of the world model on final imitation learning performance needs further analysis.
问题
Referring to the weakness:
- How is the covariate shift problem embodied in the offline imitation learning problem? More empirical studies may be needed to verify the conclusion of the submission.
- In Figure 3, why does DITTO exceed average expert performance in 3/5 environments, and is this desirable in offline imitation learning?
- Comparison with existing model-based offline RL algorithms using similar datasets would be informative.
-
Equation 5 establishes the relationship between the expert and learner return gap, and bounds it by their state distribution divergence. Our method bounds this state distribution divergence in the latent space of a learned world model. We give a proof that this bounds the return difference. The learner is penalized by a novel intrinsic reward which we introduce to minimize the covariate shift during training episodes - this is the central idea of the algorithm.
-
We give theoretical results for the exact distribution matching form of the DITTO algorithm, but for computational cost reasons approximate it to match the nearest latent from the initial expert trajectory, not from any expert trajectory. This biases the algorithm towards the mode of the distribution. We speculate that this is beneficial in certain environments (i.e. that the mode expert behavior outperforms the full expert behavior), as discussed in [1].
-
We are not doing RL, so comparing with MBRL algorithms wouldn't really make sense. We're merely using the machinery of RL in the inner loop to induce imitation learning. The point is to use an online/on-policy algorithm inside the world model. Offline MBRL methods are tackling a different set of issues and are not relevant here. There is no "offline reward" here. There are no rewards in our expert trajectories.
[1]: F. Sasaki and R. Yamashina. Behavioral cloning from noisy demonstrations. In ICLR, 2021.
Hi authors, hi reviewers!
Reviewers gyqq, dH9k: do you find the author response convincing?
Authors: would it be possible for you to provide a rebuttal to the remaining reviews?
Done!
This paper attempts to address the offline imitation learning problem, where we have (i) an expert dataset which tells the agent what the goal is and (ii) a (typically much larger) exploration dataset which tells the agent how the world works. The proposed method works by first making a world model (where the exploration dataset is helpful) and then learning a policy against this world model using a reward based on distance from the expert dataset (in the latent space). The whole setup is compatible with pixel observations.
Strengths:
- Batch IL with Pixel-based states is a hard, but important problem
- proposed approach is conceptually simple, but powerful
- Interesting corollary in appendix
- AC thinks experiments are enough (while reviewers disagree)
Weaknesses:
- clarity could be improved (by a lot)
- theory of the paper needs to be rewritten and ideally presented differently. Ideally, you want to show a divergence bound between expert and imitator. The ingredients of the the theory are there. To achieve that I think you may want to combine the results of Rafailov et al. (2021) which say that policies trained against an accurate-enough model will give you good returns in the real envs with results of Ciosek (2022), which say that training with the indicator reward minimises divergence between expert and imitator online. You already cite both papers. Ideally, in a resubmission, you should combine these results into an explicit proposition. You may also need to argue that having a lot of exploration data leads to an accurate model (which not seem too hard).
Overall, while the paper is currently below the bar, I do encourage the authors to rewrite and resubmit at a later date.
为何不给更高分
See meta-review and additional comments.
为何不给更低分
N/A
Reject