PaperHub
5.3
/10
Rejected3 位审稿人
最低3最高8标准差2.1
5
8
3
4.0
置信度
ICLR 2024

Adversarial Imitation Learning from Visual Observations using Latent Information

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We consider the challenge of learning from expert videos. We provide theoretical analysis and algorithmic solutions to this problem.

摘要

关键词
Adversarial Imitation LearningLearning from expertsLearning from pixelsReinforcement Learning.

评审与讨论

审稿意见
5

This paper has proposed a visual imitation learning approach, where the agent learns from expert observations, but is not accessible to expert actions. To deal with the high-dimensional visual observations, the imitative rewards are defined in a latent space, and the latent state space is learned with the minimizing TV divergence objective. This paper is theoretically justified, and the proposed approach is evaluated in the mujoco domain.

优点

This paper has a sound theoretical analysis.

缺点

Comparing the sample efficiency and the convergent return, the proposed approach has not shown much strength superior to the baseline methods.

This paper has missed important related works, which aims to define imitative rewards with sinkhorn distance, which is beyond the GAIL framework.

[1] Dadashi R, Hussenot L, Geist M, et al. Primal Wasserstein Imitation Learning[C]//ICLR 2021-Ninth International Conference on Learning Representations. 2021.

问题

Does the latent representation require pretraining? Or is it learned end-to-end with the policy network and Q network?

评论

General

Thank you for taking the time to review our paper. In the following, we have carefully answered all of your concerns. We humbly request that you please consider updating your review scores based on the clarifications we have provided. Once again, we want to express our appreciation for your input.

W1, W2

Comparing the sample efficiency and the convergent return, the proposed approach has not shown much strength superior to the baseline methods.

We believe that our experiments demonstrate the benefits of our proposed approach. Our algorithm achieves similar or improved asymptotic performance compared to the baselines, while demonstrating significant improvements in terms of computational efficiency. These benefits are highlighted in Table 2, Table 3, and Figure 4.

In the visual imitation from observations setting, the bottom row of Table 2 shows that our algorithm achieves 75% of expert performance significantly faster than PatchAIL [a]: 5 times faster in walker walk, hopper stand, and hopper hop, 4 times faster in walker stand and walker run, and 2 times faster in cheetah run.

In the visual imitation learning setting, Table 3 shows that our algorithm outperforms the baseline VMAIL [b] in terms of both final performance and computational efficiency.

Finally, Figure 4 shows how our algorithm can leverage expert videos to improve efficiency in RL from pixels . [a] Liu, Minghuan, et al. "Visual Imitation Learning with Patch Rewards." The Eleventh International Conference on Learning Representations. 2023.

[b] Rafailov, Rafael, et al. "Visual adversarial imitation learning using variational models." Advances in Neural Information Processing Systems 34 (2021): 3016-3028.

This paper has missed important related works, which aims to define imitative rewards with sinkhorn distance, which is beyond the GAIL framework.

[1] Dadashi R, Hussenot L, Geist M, et al. Primal Wasserstein Imitation Learning[C]//ICLR 2021-Ninth International Conference on Learning Representations. 2021.

Thank you for sharing this paper with us. We have added this to our Related Work section in the updated version of the paper.

Q1

Does the latent representation require pre-training? Or is it learned end-to-end with the policy network and Q network?

The latent representation is entirely learned end-to-end and no pre-training is involved at any stage.

审稿意见
8

The paper considers the problem of imitation learning from visual demonstrations, where not only the actions are unobservable, but also---due to partial observability---the underlying state. An upper bound is presented, that shows that the suboptimality of the learner can be bounded based on the TV-distance of the respective distributions over transitions in a latent space that compresses a history of observations and is assumed to be a sufficient statistic of the complete history (including actions). Motivated by this bound, a method is presented that performs imitation learning by applying GaifO (GAIL with a state-transition discriminator) using the latent representations instead of the states. The latent representations are learned during imitation learning by backpropagating the Q-function loss of DDPG through the encoder (the Q-function is expressed as Q(z(x_t:t),a)Q(z(x\_{t^{-}:t}), a), with observation history xt:tx_{t^{-}:t}). No other losses (e.g. policy or discriminator loss) are backpropagated through the encoder.

This method is compared to the baseline PatchAIL in the "Visual imitation from Observeration" (V-IfO) setting and to LAIL in the visual imitation learning (VIL) setting, where expert actions are observed and their history is used for computing the embedding. In both settings, the proposed method LAIFO/LAIL compares favorable to the baseline methods in terms of stability, final performance and training time. Furthermore, the paper investigates the RL from demonstration setting, where the discriminator reward is augmented with a known reward function to guide exploration using demonstrations for vision-based locomotion tasks, which significantly improves performance compared to methods that do not make any use of demonstrations.

优点

  1. Soundness

  • The overall approach of learning a latent representation and imitating the expert with respect to latent transitions is sound.

  • The derived Theorems seem to be correct.

  • The claims are substantiated, and the main weaknesses (e.g. that expert and learner act in the same POMDP) are clearly communicated.

  1. Relevance

  • Imitation learning from (actual) observations is an important problem. Although I agree that learning under dynamic mismatch is still a key limitation, I think that the considered problem setting is still a useful step towards this objective.
  1. Novelty

  • The proposed method seems to be novel.
  1. Presentation

  • The method was very well presented. The paper was a very read for me, which, however, is also partially due to the fact that the method is very straightforward.
  1. Related work

  • I'm not very familiar with the particular problem setting of imitation learning in POMDPs with unobserved actions, so I am not sure that no important baseline is missing. But the paper certainly does discuss several important relevant works. I am only of a recent work by Al-Hafez et al. (2023) that performs imitation learning for locomotion without action observations, but does not consider partial observability due to visual observations.

Al-Hafez, F., Tateo, D., Arenz, O., Zhao, G., & Peters, J. (2023). LS-IQ: Implicit reward regularization for inverse reinforcement learning. (ICLR).

缺点

  1. Experiments

  • The results presented in Table 2 do not seem to be statistically significant. I think it is misleading to highlight the best final performance in bold despite overlapping confidence intervals.

  • The experiments in the imitation learning from demonstration setting are not fair as none of the baseline makes use of the expert demonstrations. It would be better to compare to methods that focus on this problem setting.

  1. Originality

While I think that the method is novel, it also very straightforward and simple. While I do believe that simple methods are good, I could not get many new insights from the paper (the theorems are also relatively straightforward variations of previous theorems that bound suboptimality based on TV distance in IL and RL).

问题

How can adding a reward objective to the imitation learning objective be justified? Can the detrimental effects of one objective on the other be bounded in some way?

It is common to not backpropagate through the actor in representation learning, and I also think that for similar reasons it makes sense to not backpropagate through the discriminator in the adversarial IL setting. However, did you consider addional (or alternate) methods to learn better representations? For example, many representation learning methods use additional objectives, e.g. contrastive losses, maximizing predictive information, which can significantly improve the downstream performance, in particular in RL from images.

评论

General

Thank you for reviewing our paper. We appreciate your questions and in response we have carefully answered each of them. Thank you once again for your input, which has contributed to improving our work.

W1: Experiments

The results presented in Table 2 do not seem to be statistically significant. I think it is misleading to highlight the best final performance in bold despite overlapping confidence intervals.

We understand the concern raised by the reviewer. We chose to highlight the best final performance in bold based on conventions established in previous literature (refer to [1,2] for instance).

[1] Yang, Chao, et al. "Imitation learning from observations by minimizing inverse dynamics disagreement." Advances in neural information processing systems 32 (2019).

[2] Liu, Minghuan, et al. "Visual Imitation Learning with Patch Rewards." The Eleventh International Conference on Learning Representations. 2023

The experiments in the imitation learning from demonstration setting are not fair as none of the baseline makes use of the expert demonstrations.

The main goal of Figure 4 and of the paragraph entitled “Improving RL using expert videos” is to show how our algorithm can leverage videos of experts to improve the efficiency of standard RL from pixels (see question (4) at the beginning of the Experiments section). As a result, in our view it makes sense to propose the comparison as in Figure 4. We also want to emphasize that we only assume access to visual observations of the expert in our algorithm and not direct knowledge of expert states, actions, and rewards since such information cannot be easily obtained when using videos of the expert.

In all the remaining paragraphs within the Experiments section, our algorithm and the baselines are compared in the same setting, i.e., either visual imitation learning or visual imitation from observations.

W2: Originality

While I do believe that simple methods are good, I could not get many new insights from the paper.

Thank you for the opportunity to further discuss this point. We also agree that simple methods are good, and we believe that interesting insights come both from the Theoretical Analysis and the Experiments sections.

The Theoretical Analysis section sheds light on the assumptions needed between expert and agent in order to perform visual imitation from observations, and motivates the idea of latent variable inference from high dimensional observations. Note that this is in contrast with PatchAIL where imitation is directly performed on the pixel space.

The Experiments section shows the practical implications of performing imitation on a lower dimensional space rather than on the pixel space. It is also clear from the Experiments section that our straightforward approach to latent variable estimation leads to strong performance in practice, without requiring more complicated inference procedures.

Q1

How can adding a reward objective to the imitation learning objective be justified? Can the detrimental effects of one objective on the other be bounded in some way?

By assumption, we consider both the expert and the agent acting on the same POMDP. As a result, in this specific case, the reward objective represents the reward, possibly sparse, which is maximized by the expert. The imitation learning objective, instead, provides a dense reward to the learning agent. Adding the imitation learning objective to this sparse reward yields the improved efficiency illustrated in Figure 4.

We believe that considering the impact of suboptimal data and/or relaxing the aforementioned assumption represent interesting directions for future work.

Q2

Did you consider additional (or alternate) methods to learn better representations? For example, many representation learning methods use additional objectives, e.g. contrastive losses, maximizing predictive information, which can significantly improve the downstream performance, in particular in RL from images.

Thank you for asking. We have tested different methods for latent variable inference including variational inference [3], as well as different auxiliary contrastive losses for the feature extractor ϕ\phi [4, 5]. The simple and effective approach presented in this paper outperformed these more complicated methods in the described setting.

[3] Lee, Alex X., et al. "Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model." Advances in Neural Information Processing Systems 33 (2020): 741-752.

[4] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

[5] Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-supervised learning." Advances in neural information processing systems 33 (2020): 21271-21284.

评论

Thank you for your response. I still believe that the weaknesses I initially raised are valid, namely:

  1. The insights are limited
  2. The use of boldface in Table 2 is misleading.
审稿意见
3

The paper tackles the problem of "Visual Imitation from Observations" (V-IfO), where the only learning source is a set of RGB observations of a task. The theoretical contribution is establishing a new upper bound on the learner's suboptimality based on its divergence to the expert's state-transitions as encoded in some latent. Methodology-wise, the authors propose a new algorithm "Latent Adversarial Imitation from Observations" (LAIfO), which combines existing methodology from inverse RL (IRL) with observation stacking and data-augmentation from recent off-policy RL algorithms. Empirically, the authors show their algorithm trains in less wall-clock time while retaining the same performance to other recent state-of-the-art imitation algorithms on six DeepMind Control tasks. Moreover, they also show that incorporating demostrations with off-policy learning and rewards can speed up existing off-policy RL algorithms on three of the more challenging DeepMind Control tasks.

优点

  • Overall, the paper is well-written. In particular, the authors make an appreciated effort to clearly define notation and assumptions before delving into the analysis.

  • The methodology is clear and simple and mostly reproducible.

  • I appreciate the purpose of the paper, visual imitation is a relevant problem.

缺点

  1. The proposed algorithm combines the adversarial imitation loss with gradient penalties from DAC [1] with the off-policy algorithm, stacking and data-augmentation strategy from DrQv2 [2]. While both this papers are cited in text, the way the methodology is introduced in Section 5 never makes these connection explicit. As a consequence, I feel the way the algorithm is prevented can be quite misleading to an unfamiliar reader. Hence, I believe changing Section 5 to clarify which components come from DAC, which ones come from DrQv2 and that the novelty lies in combining them, would be extremely important before this work can be published.

  2. I found the novelty of the theoretical analysis and methodology to be quite limited. While I believe this is not a mandatory aspect for a good paper, especially if the resulting algorithm is effective, I found the quality of the empirical evaluation insufficient to make such assessment (see point 3).

  3. There are several aspects of the evaluation that left me unsatifsfied with its quality. First, the comparison with PatchAIL-W and VMAIL is only carried out on six tasks from three environments from the DeepMind Control (DMC) suite, while the comparison with DrQv2 and Dreamer-v2 is only carried out in three tasks from a single environment. I would have appreciated seeing a wider variety (e.g., including other complex environments from DMC such as quadruped/jaco arm and from alternative benchmarks e.g., car racing, claw rotate as considered in VMAIL). Furthermore, the current ablation seems very much limited as it could consider studying the effect of performance of many additional design choices (e.g. spectral norm v gradient penalty for Lipshitzeness/number of stacked frames/type of data augmentation...). Additionally, I think that reporting results also for a simple behavior cloning baseline with the same data-augmentation/architecture/optimization would help understand the contribution from the introduced IRL methodology. Most worryingly, however, when comparing LAlfO with Dreamerv2 and DrQv2 the performance of the baselines is considerably lower than what reported in prior work (e.g. see [2]). Even after 10x10^6 milion steps, the gains from incorporating expert demonstrations seem marginal at best (if any) when using the results from DrQv2. I would really appreciate it the authors could clarify this inconsistency. (also given that DrQv2 shares the data that produced their reported learning curves)

  4. Again, related to the evaluation Section, I find some of the claims to be quite misleading. E.g. in connection to the humanoid results the authors state "we solve these tasks by using only 10^7 interactions" However, the reported performance on 2/3 tasks (walk and run) is still extremely low, and I would refrain from referring to any of these tasks as solved. Furthermore, I think to make the comparison fairer I would have also appreciated seeing results for DrQv2/Dreamerv2 adding the expert demonstrations to their respective replay buffer.

Minor:

I believe the visual imitation problem setting described is a special simpler case of the visual third-person/observational imitation learning setting tackled by prior methods [3 as cited, 4, 5]. Yet, in contrast to what stated in Related Work ("All of the aforementioned works consider fully observable environments"), also this line of work deals with visual observation. Hence, I believe there should be a clearer explicit connection.

[1] Kostrikov, Ilya, et al. "Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning." arXiv preprint arXiv:1809.02925 (2018).

[2] Yarats, Denis, et al. "Mastering visual continuous control: Improved data-augmented reinforcement learning." arXiv preprint arXiv:2107.09645 (2021).

[3] Stadie, Bradly C., Pieter Abbeel, and Ilya Sutskever. "Third-person imitation learning." arXiv preprint arXiv:1703.01703 (2017).

[4] Okumura, Ryo, Masashi Okada, and Tadahiro Taniguchi. "Domain-adversarial and-conditional state space model for imitation learning." 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020.

[5] Cetin, Edoardo, and Oya Celiktutan. "Domain-robust visual imitation learning with mutual information constraints." arXiv preprint arXiv:2103.05079 (2021).

问题

  • Where can I find detail regarding the expert data (is it taken from a standard benchmark? Was it collected with any particular protocol?) I cannot find this important information in the main text.

  • Can the authors provide learning curves for an increased number of steps for the Humanoid tasks, if available? (or at least also show the DrQv2 learning curves for the full 3x10^7 steps, shared in their repository to provide a refence for Figure 4)

In conclusion, while I mostly appreciate the nature of the contribution, the direction, and the presentation of the paper, I believe there are some current major flaws that make it not, yet, ready for publication. For this reason, I am currently leaning towards rejection. However, I am willing to change my score, in case the authors manage to properly address my criticism and questions.

评论

General

Thank you for reviewing our paper. We appreciate your questions, and we have carefully answered them in order to address your concerns. We kindly request that you please consider increasing your review scores in light of the responses we provided. Once more, we sincerely appreciate your contribution, which has helped to improve the quality of our work.

W1, W2

The proposed algorithm combines the adversarial imitation loss with gradient penalties from DAC [1] with the off-policy algorithm, stacking and data-augmentation strategy from DrQv2 [2]. The way the methodology is introduced in Section 5 never makes these connections explicit. I believe changing Section 5 to clarify which components come from DAC, which ones come from DrQv2 and that the novelty lies in combining them, would be extremely important before this work can be published.

We have updated the beginning of Section 5 to cite all the literature inspiring our algorithm, including DAC and DrQ v2, in order to improve clarity and transparency.

We want to emphasize that the combination of the different elements from DAC and DrQ v2 is well-motivated by the novel theoretical analysis in Section 4, which is instrumental for the formulation of our algorithm and of other possible variants.

I found the novelty of theoretical analysis and methodology to be quite limited. While I believe this is not a mandatory aspect for a good paper, especially if the resulting algorithm is effective, I found the quality of the empirical evaluation insufficient to make such an assessment (see point 3).

Thank you for the opportunity to further discuss this point. We believe that important insights come both from our theoretical analysis and our experiments.

To the best of our knowledge, theoretical analysis of the imitation from observations problem in a partially observable environment does not appear in previous work. The analysis in Section 4 sheds light on the assumptions needed between expert and agent in order to perform visual imitation from observations, and motivates the idea of latent variable inference from high dimensional observations. Note that this is in contrast with PatchAIL in [a] where imitation is directly performed on the pixel space.

The Experiments section shows the practical implications of performing imitation on a lower dimensional space rather than on the pixel space. This results in significant improvements in computational efficiency, as emphasized by the wall-clock time ratio in Table 2.

[a] Liu, Minghuan, et al. "Visual Imitation Learning with Patch Rewards." The Eleventh International Conference on Learning Representations. 2023.

W3, W4

The comparison with PatchAIL-W and VMAIL is only carried out on six tasks from three environments from the DeepMind Control (DMC) suite, while the comparison with DrQv2 and Dreamer-v2 is only carried out in three tasks from a single environment. I would have appreciated seeing a wider variety (e.g., including other complex environments from DMC such as quadruped/jaco arm and from alternative benchmarks e.g., car racing, claw rotate as considered in VMAIL).

The goal of our Experiments section is to answer questions (1)-(4) as introduced at the beginning of Section 6 and not be exhaustive. We agree with the reviewer that our paper could benefit from additional experiments, but we do not think that the main message of the section will change if more environments or tasks are added.

Moreover, in terms of quantity and difficulty of the tested environments, our paper is aligned with the convention established by the previous literature. Specifically, for the comparison with PatchAIL-W [a] and VMAIL [b], we test 3 easy tasks and 3 medium tasks from the DMC suite. In [a], 4 easy tasks and 2 medium tasks from the DMC suite are tested. In [b], 5 environments are tested.

For the comparison with DrQ v2 and Dreamer v2, we carry out experiments only in tasks which are considered hard to solve. In these types of tasks, leveraging expert videos makes more sense and the impact of our solution can be better appreciated as DrQ v2 and Dreamer v2 are quite inefficient.

[b] Rafailov, Rafael, et al. "Visual adversarial imitation learning using variational models." Advances in Neural Information Processing Systems 34 (2021): 3016-3028.

评论

The current ablation seems very much limited as it could consider studying the effect of performance of many additional design choices (e.g. spectral norm v gradient penalty for Lipshitzeness/number of stacked frames/type of data augmentation...).

The main goal of the current ablation study is to answer question (3) as described at the beginning of Section 6. We agree that many other ablation studies are possible in order to evaluate our design choices. However, we decided to focus on an ablation across Imitation from experts types as introduced in Table 1 rather than ablating over design choices. We considered this type of ablation more aligned with the concepts presented throughout the paper.

I think that reporting results also for a simple behavior cloning baseline with the same data-augmentation/architecture/optimization would help understand the contribution from the introduced IRL methodology.

Behavior cloning would only be possible in our Visual Imitation Learning experiments where expert observation (state)-action pairs are available. As demonstrated in previous literature (cf. [a]), we do not expect this approach to be competitive and this is why we did not include it in our experiments.

[a] Liu, Minghuan, et al. "Visual Imitation Learning with Patch Rewards." The Eleventh International Conference on Learning Representations. 2023.

Most worryingly, however, when comparing LAlfO with Dreamerv2 and DrQv2 the performance of the baselines is considerably lower than what reported in prior work (e.g. see [2]). Even after 10x10^6 million steps, the gains from incorporating expert demonstrations seem marginal at best (if any) when using the results from DrQv2. I would really appreciate it if the authors could clarify this inconsistency. (also given that DrQv2 shares the data that produced their reported learning curves)

For a fair comparison, all the experiments have been conducted using the same code: https://anonymous.4open.science/r/AIL_from_visual_obs-2C3C/README.md

We also noticed the aforementioned inconsistency. For the sake of transparency, we have included in the updated Appendix both the results obtained in our experiments and the results obtained considering the DrQ v2 reported learning curves. In both cases, RL+LAIfO achieves significant improvements compared to DrQ v2 after 10^7 interactions.

The reason for this inconsistency is probably due to the inherent randomness in each run (even if the seed is fixed). This comes from the fact that the rendering is non deterministic. We refer to the following issues for more on this:

Again, related to the evaluation Section, I find some of the claims to be quite misleading. E.g. in connection to the humanoid results the authors state "we solve these tasks by using only 10^7 interactions" However, the reported performance on 2/3 tasks (walk and run) is still extremely low, and I would refrain from referring to any of these tasks as solved.

Thank you for pointing this out. We have updated the wording to clarify this claim in the updated version of the paper.

Furthermore, I think to make the comparison fairer I would have also appreciated seeing results for DrQv2/Dreamerv2 adding the expert demonstrations to their respective replay buffers.

The main goal of these experiments is to show how our algorithm can leverage videos of experts to improve the efficiency of standard RL from pixels (see question (4) at the beginning of the Experiments section). This is the reasoning behind the proposed comparison in Figure 4.

We also want to emphasize that we only assume access to visual observations of the expert in our algorithm and not direct knowledge of expert states, actions, and rewards since such information cannot be easily obtained when using videos of the expert. This condition rules out several offline RL techniques which might be used in order to leverage expert data in DrQ v2 and Dreamer v2.

Minor W

I believe the visual imitation problem setting described is a special simpler case of the visual third-person/observational imitation learning setting tackled by prior methods [3 as cited, 4, 5]. Yet, in contrast to what stated in Related Work ("All of the aforementioned works consider fully observable environments"), also this line of work deals with visual observation. Hence, I believe there should be a clearer explicit connection.

Thank you for sharing these papers with us. We have added them to our Related Work section and clarified the connections with our work in the updated version of the paper.

评论

Q1

Where can I find detail regarding the expert data (is it taken from a standard benchmark? Was it collected with any particular protocol?) I cannot find this important information in the main text.

The expert policies are obtained by training DDPG in a fully observable setting as described in the caption of Table 2. The full set of pre-trained policies can be downloaded from https://figshare.com/s/22de566de2229068fb75 . This information is available in the README of our code repository, and we have updated the caption of Table 2 to make this clear in the main text.

Can the authors provide learning curves for an increased number of steps for the Humanoid tasks, if available? (or at least also show the DrQv2 learning curves for the full 3x10^7 steps, shared in their repository to provide a reference for Figure 4)

At the moment, we can provide curves for all the experiments as shown in the paper. We are now running RL+LAIfO for the full 3x10^7 steps and we will also provide these curves as soon as they are ready.

评论

Unfortunately, while I appreciate the effort to respond to raised concerns, the modifications currently made to the text are minimal and I believe all my criticism still stands.

In particular, while the novelty lies in combining the existing practices from prior work, the way the work is presented never makes this clear, even in the current version (I did not find the small changes made to page 5 very helpful in this regard).

Moreover, I still found the evaluation to be unsatisfactory for the reasons listed in my original review. In their response, the authors note that, for instance, VMAIL also tests for 5 environments. However, I would like to point out that these environments involve a much higher degree of diversity: they including continuous control locomotion, car racing, and complex robotic manipulation tasks. Moreover, VMAIL's experimental section considers several ablations considers several ablations and auxiliary experiments to better understand their algorithm.

While contributions with relatively limited conceptual novelty and analysis can be impactful if they show consistent improvements, I believe they require a clear presentation and some breadth in their empirical evaluation for validation. Currently, I believe the paper is lacking in precisely these aspects, and I would the encourage the authors in considerations the provided criticism in future revisions of their work.

评论

We want to thank the AC for handling our paper and the reviewers for providing valuable comments. We have addressed the reviewers' comments point-by-point and uploaded an updated version of the paper. Once more, thank you for your feedback, which has helped to improve the quality of our work.

评论

Dear reviewers,

The author-reviewer discussion period is coming to a close soon. We have tried to provide detailed responses in our rebuttals, which we hope address your questions and concerns. Please let us know if you have any further questions, and we will be happy to provide additional details prior to the end of the discussion period. If we have addressed your main concerns, we ask that you please consider updating your review scores based on the responses we have provided. Thank you again for taking the time to review our paper.

AC 元评审

This paper considers imitation learning from visual demonstrations. The problem is challenging because the actions and state are only partially observed. Using an upper bound on the learner suboptimality, a GAIL-based learning approach is developed. Improved performance compared to existing methods on six DeepMind control tasks are demonstrated.

Reviewer opinions of the novelty of the paper were mixed, but generally felt the insights from the work were limited. Experiments were not strongly convincing and, as pointed out, were narrower in scope than other related work. Either improving the insights from the paper or the experimental results seem to be promising directions for convincing reviewers in a future revision. These issues currently prevent a recommendation of acceptance for the paper.

为何不给更高分

Limited experimental evaluation compared to related work and limited insights conveyed in the paper.

为何不给更低分

N/A

最终决定

Reject