5.8

/10

Rejected4 位审稿人

最低5最高6标准差0.4

4.3

置信度

正确性2.3

贡献度2.3

表达2.5

ICLR 2025

Reward-free World Models for Online Imitation Learning

Shangzhe Li,Zhiao Huang,Hao Su

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We propose a novel approach to online imitation learning that leverages reward-free world models to solve complex tasks.

摘要

关键词

World ModelsImitation LearningInverse Reinforcement Learning

评审与讨论

审稿意见

评分: 6置信度: 42024-10-24

This paper broadly considers problems in the regime of interactive imitation learning by learning with learned world models. It differs from previous model-based imitation learning techniques (e.g. model based inverse reinforcement learning) in that the model does not explicitly learn a reward function, and instead does optimization within the space of Q-functions. In essence, this work seems to lift the insights from IQ-Learn [1] to the space of models-based learning.

[1] Garg, Divyansh, et al. "Iq-learn: Inverse soft-q learning for imitation." Advances in Neural Information Processing Systems 34 (2021): 4028-4039.

优点

This paper is well structured, with a clearly written formulation and decent benchmarks/baselines across which empirical results are shown. The experiments are also thoughtful, covering a variety of questions such as design ablations, impact of number of expert demonstrations, and correlation with reward. The environments for such evaluations are also chosen so as to include both locomotion and manipulation tasks.
The explanation of the proposed soft-Q learning objective within a learned model is clear and well-grounded in prior work.
The proposed method shows compelling sample efficiency gains over selected baselines, and addresses the instability of prior interactive IL approaches. The authors also document clearly their use of various techniques to improve training stability.
The method demonstrates good performance in both state-based and image-based environments.
The proposed method performs reasoning entirely in latent space, and is not limited by a reconstruction objective when learning the world model, which allows it to handle high-dimensional input such as images without suffering huge computational overhead from the reconstruction objective.

缺点

High Level Feedback:

The paper introduces a novel framework of reward-free world model, but the main contributions seem to build heavily upon existing work such as IQ-Learn [1] and TD-MPC [2, 3]. I recommend making more explicit highlights of any new theoretical insights or mathematical contributions beyond the removal of reward modeling, or highlighting the transferrable empirical insights from the methodology section.
The authors claim that reward-based world models lead to training instability could be supported more strongly in the empirical results. In particular. Relatedly, it has been shown that IQ-Learn [1] is rather unstable in prior works, so it would be nice to see how this reward-free world model compares to other works focused on sample-efficiency in online IL (both model-free and model-based), such as [4].
The method mentions the use of MPPI for planning in latent space, which could be potentially computationally demanding due to the large number of sampled trajectories. It would be nice to see a more detailed analysis or mentioning of the computational overhead of the proposed method compared to baselines, especially during real-time execution. It could be helpful for the authors to provide some mentioning of empirical metrics for the computational overhead, perhaps in comparison to baselines.
While the empirical results demonstrates efficiency comparisons with other online IL methods, it does not show the performance of offline IL methods. In the realizable setting (with 100 expert demonstrations), how does an offline IL method such as Diffusion Policy [5] perform?

Low Level Technical Feedback:

In the first paragraph to Section 4, the authors claim in lines 202-204 that they propose learning "a reward-free world model... without requiring explicit reward data". But the next sentence from lines 204-206 proceed to state, "our model can decode dense rewards for the environment, allowing it to learn a reward function through...". This seems self-contradictory, so I recommend authors clean up the language such that it is consistent.

[1] Garg, Divyansh, et al. "Iq-learn: Inverse soft-q learning for imitation." Advances in Neural Information Processing Systems 34 (2021): 4028-4039.

[2] Hansen, Nicklas, Xiaolong Wang, and Hao Su. "Temporal difference learning for model predictive control." arXiv preprint arXiv:2203.04955 (2022).

[3] Hansen, Nicklas, Hao Su, and Xiaolong Wang. "Td-mpc2: Scalable, robust world models for continuous control." arXiv preprint arXiv:2310.16828 (2023).

[4] Ren, Juntao, et al. "Hybrid inverse reinforcement learning." ICML (2024).

[5] Chi, Cheng, et al. "Diffusion policy: Visuomotor policy learning via action diffusion." The International Journal of Robotics Research (2023): 02783649241273668.

问题

A number of prior works seem to observe the instability of IQ-Learn in the regime of low expert demonstrations. The proposed approach seems to handle this instability better, despite following a similar framework. I am curious to hear why the authors posit this to be the case?
I am curious how adding noise to the environment dynamics will affect the performance of all methods and baselines? It seems that prior work has seen these Q-function based learning methods (IQ-Learn in particular) to suffer from stochastic transition dynamics [1].
I appreciate the detailing of the hyperparameters used in the appendix. How sensitive is the performance of IQ-MPC to changes in these hyperparameters?
As mentioned in the weaknesses section, it would be nice to see a comparison to other online IL baselines such as those in [1], as well as offline BC baselines such as those in [2]. In particular, do offline BC/other online IL methods already come close to expert performance in the case of ~100 expert demonstrations? How about in the case of image-based inputs?

[1] Ren, Juntao, et al. "Hybrid inverse reinforcement learning." ICML (2024).

[2] Chi, Cheng, et al. "Diffusion policy: Visuomotor policy learning via action diffusion." The International Journal of Robotics Research (2023): 02783649241273668.

2024-11-21

We thank the reviewer for the constructive review.

Weaknesses

W1: The paper introduces a novel framework of reward-free world model, but the main contributions seem to build heavily upon existing work such as IQ-Learn [1] and TD-MPC [2, 3]. I recommend making more explicit highlights of any new theoretical insights or mathematical contributions beyond the removal of reward modeling, or highlighting the transferrable empirical insights from the methodology section.

A1: We highlight the theoretical and empirical insights as follows:

Additional Theoretical Guarantee In Section 6 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free, we have included an additional theoretical analysis of our approach. We use a sub-optimality bound to show that our training minimizes two key terms: the distance between expert and learned state-action distributions and the divergence between true and learned dynamics. The critic and policy objectives address the first term, while the consistency loss minimizes the second. Together, these ensure the value function $V^\pi$ approximates the expert value $V^{\pi_E}$ as the model learns.
Better Performance on Complex Tasks and Sampling Efficiency Empirically, our method demonstrates stable, expert-level performance across a range of complex tasks. By learning a latent transition model, our approach achieves greater stability and better sampling efficiency (especially in high-dimensional settings such as Dog environment) than model-free methods like IQ-Learn.
Broad Application of the Reward-free Setup for World Model Training Our approach offers a potential solution for task-oriented world model learning from human demonstrations in cases where rewards are difficult to obtain. This can support world model training in real-world robotics applications.

W2: The authors claim that reward-based world models lead to training instability could be supported more strongly in the empirical results. In particular. Relatedly, it has been shown that IQ-Learn [1] is rather unstable in prior works, so it would be nice to see how this reward-free world model compares to other works focused on sample-efficiency in online IL (both model-free and model-based), such as [4].

A2: We would like to clarify that we did not claim that reward-based world models lead to training instability. Rather, we assert that our approach demonstrates more stable performance compared to previous methodologies. Regarding the additional empirical assessment, we provide the comparison with the model-free version of hybrid IRL (HyPE). We present the experimental results for the extra baseline in the additional materials available in Section 1 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

Regarding the model-based version (HyPER), we find it to be highly costly in terms of model rollouts and prone to instability during training. We were unable to achieve stable, expert-level results with HyPER in our benchmarks.

W3: While the empirical results demonstrates efficiency comparisons with other online IL methods, it does not show the performance of offline IL methods. In the realizable setting (with 100 expert demonstrations), how does an offline IL method such as Diffusion Policy [5] perform?

A3: We experimented on Walker Run task with diffusion policy, using the same 100 expert demonstrations. We find that we cannot obtain satisfactory performance with this amount of data. The training MSE loss is converged but the policy cannot reach optimal. We present our additional experimental results for diffusion policy in Section 5 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

W4: In the first paragraph to Section 4, the authors claim in lines 202-204 that they propose learning "a reward-free world model... without requiring explicit reward data". But the next sentence from lines 204-206 proceed to state, "our model can decode dense rewards for the environment, allowing it to learn a reward function through...". This seems self-contradictory, so I recommend authors clean up the language such that it is consistent.

A4: We apologize for the confusion. Our model, trained on demonstrations without reward data, is capable of estimating rewards for previously unseen state-action pairs. We will revise the writing of this part.

2024-11-21

Questions

Q1: A number of prior works seem to observe the instability of IQ-Learn in the regime of low expert demonstrations. The proposed approach seems to handle this instability better, despite following a similar framework. I am curious to hear why the authors posit this to be the case?

A1: The key difference between our method and IQ-Learn is that we learn the Q-function in a latent space while simultaneously training a latent dynamics model. We believe that this latent representation has advantageous properties, which makes it easier to distinguish between expert and behavioral demonstrations. We would also like to emphasize that model-based approaches offer the advantage of improved generalization and sample efficiency. Additionally, learning a latent dynamics model effectively supports learning and bounds sub-optimality, as detailed in the additional theoretical analysis provided in Section 6 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

Q2: I am curious how adding noise to the environment dynamics will affect the performance of all methods and baselines? It seems that prior work has seen these Q-function based learning methods (IQ-Learn in particular) to suffer from stochastic transition dynamics [1].

A2: Since the TD-MPC architecture is fully deterministic, it is not well-suited for tasks involving stochastic dynamics. Consequently, our model is designed specifically for environments that are primarily deterministic, such as those encountered in locomotion and manipulation robotics tasks. If modeling stochastic dynamics is necessary, architectures that incorporate stochastic states, like RSSM, would be more appropriate. This represents a potential direction for future research. However, our additional experiments indicate that in low-dimensional settings, our model demonstrates certain robustness even with noisy environmental dynamics. We present the experimental results on this topic in the additional materials available in Section 4 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

Q3: I appreciate the detailing of the hyperparameters used in the appendix. How sensitive is the performance of IQ-MPC to changes in these hyperparameters?

A3: We conducted an additional ablation study on the hyperparameter $\alpha$ , which controls the $\chi^2$ regularization term. Empirically, we found that a large $\alpha$ leads to higher Q-estimation values, which can cause training instability. In contrast, a small $\alpha$ yields more stable but sub-optimal behaviors, as the penalty on reward magnitude becomes too strong. We present the experimental results for the ablation in the additional materials available in Section 3 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

Q4: As mentioned in the weaknesses section, it would be nice to see a comparison to other online IL baselines such as those in [1], as well as offline BC baselines such as those in [2]. In particular, do offline BC/other online IL methods already come close to expert performance in the case of ~100 expert demonstrations? How about in the case of image-based inputs?

A4: We have provided our responses in the replies for W2 and W3.

2024-11-22

I appreciate the author's response to my concerns and questions, as well as the additional experiments conducted. As other reviewers have mentioned, the clarity of writing in this paper seems to be a sizable concern. I also agree with Reviewer 4dPD and Reviewer D7Rz's shared concern as to the contribution made in this paper. In particular, Reviewer D7Rz agrees with my Q2 about the transferability of reward functions in comparison to Q-functions.

To all these concerns, I would strongly urge the authors to include the related works mentioned by Reviewer D7Rz and the additional experiments conducted to the revised version of this manuscript.

2024-11-22

We thank the reviewer for the response. We are currently finalizing the revisions to our manuscript and will submit the updated version shortly.

2024-11-23

We appreciate the reviewer’s constructive feedback and have revised the manuscript accordingly. Below, we address the reviewer’s concerns:

Deeper technical insights.

A1: We have added the theoretical analysis of our approach in Section 4.2 and Appendix H.3 of the revised manuscript.

Additonal baseline for comparsion.

A2: We have added the additional baseline, HyPE, on 6 locomotion tasks and 3 dexterous hand manipulation tasks. The new results are presented in Figure 2, Figure 3, and Table 1 of the revised manuscript.

Computational overhead concern.

A3: We have added an analysis of the additional computational time consumption in Appendix F of the revised manuscript.

Impact of noisy environment dynamics.

A4: We have added an experiment on noisy environment dynamics, presented in Figure 15 of Appendix E.4 in the revised manuscript.

More detailed hyperparameter analysis.

A5: We have included an additional ablation study of the hyperparameter $\alpha$ in Figure 14 of Appendix E.3.

We will continue to revise the manuscript and welcome any further feedback or suggestions the reviewer may have.

2024-11-26

Thank you for detailing the updates to the manuscript; however, it seems like the version on OpenReview is still the original version, so I am not able to see the actual changes. Is this just due to the authors not being able to update the manuscript during the review process?

2024-11-26

We appreciate the reviewer’s feedback. Regarding the issue of not being able to view the new manuscript, we can confirm that it is accessible on our end. The reviewer may try accessing it through the following link:

https://openreview.net/notes/edits/attachment?id=oIuP9EIuvE&name=pdf

Please feel free to reach out if there are any other issues or concerns.

2024-11-26

Thank you! I can see the edits, and thank the authors for answering my questions and making the changes. I have increased my score.

2024-11-26

Thank you for your review and feedback. Please just let us know if we can further clarify anything.

审稿意见

评分: 6置信度: 32024-10-26

The manuscript presents a novel approach to online imitation learning (IL) that leverages reward-free world models, specifically focusing on latent-space representations rather than explicit rewards. The proposed model, IQ-MPC, utilizes inverse soft-Q learning in the policy space, sidestepping traditional optimization in the reward-policy space, which has previously led to instability in online IL applications. IQ-MPC aims to demonstrate robust, expert-level performance across high-dimensional and complex control tasks using a combination of latent dynamics and model predictive control (MPC) in the absence of explicit reward functions.

优点

Novelty in Reward-Free Approach: The idea of removing reward dependence by operating in latent space addresses a significant challenge in imitation learning, making the model highly adaptable to complex environments with intricate dynamics.

Comprehensive Evaluation: Experiments conducted on a variety of benchmarks, such as DMControl, MyoSuite, and ManiSkill2, demonstrate the model's capabilities across tasks, including both state-based and visual IL tasks.

缺点

It is recommended that the author further explain the experimental results in 5.2, such as why the model restores rewards closer to the ground-truth rewards when the medium, and why the model prediction variance increases sharply when the real rewards are high; encourage the author to conduct further experiments to illustrate the extent to which the deviation of reward prediction affects the results of MPPI;
The author mentioned that the main advantages of the world model are sampling complexity and its future planning ability. The manuscript focuses on its excellent planning ability but lacks comparison with the baseline in sampling complexity. It is recommended to expand the experiment in 5.3 and compare IQ-MPC with the baseline in the case of a small amount of expert data;
Section 3 gives a general mapping of the inverse Bellman equation to recover rewards. Experiments can be added to reflect the performance of other baseline methods in reward recovery.

问题

Please refer to weakness.

I have made every effort to review this paper; however, I would appreciate it if the area chair could consider that its content is not closely aligned with my expertise when evaluating all the reviews.

2024-11-21

We thank the reviewer for the constructive review.

Q1: It is recommended that the author further explain the experimental results in 5.2, such as why the model restores rewards closer to the ground-truth rewards when the medium, and why the model prediction variance increases sharply when the real rewards are high; encourage the author to conduct further experiments to illustrate the extent to which the deviation of reward prediction affects the results of MPPI;

A1: One possible explanation for the high variance in the estimated expert rewards is as follows:

There are multiple equivalent reward formulations that lead to optimal trajectories, and the maximum entropy objective selects the one with the highest entropy. Our actor-critic architecture is optimized with the maximum entropy inverse RL objective, leading to an even distribution of rewards for expert demonstrations. As a result, rewards that are closer to the expert tend to exhibit higher variance. A similar phenomenon is also observed in [1].

Q2: The author mentioned that the main advantages of the world model are sampling complexity and its future planning ability. The manuscript focuses on its excellent planning ability but lacks comparison with the baseline in sampling complexity. It is recommended to expand the experiment in 5.3 and compare IQ-MPC with the baseline in the case of a small amount of expert data;

A2: For the sampling complexity analysis, our method converges to the optimum significantly faster than model-free methods (IQL+SAC) in high-dimensional settings. This trend is demonstrated in Figure 10 of our manuscript. A similar phenomenon is observed in the visual experiment with the Walker Walk task, as shown in Figure 5. We should also clarify that better sampling complexity does not refer to a smaller number of expert demonstrations. Instead, sampling complexity refers to the number of environment interactions required to reach optimal performance.

Q3: Section 3 gives a general mapping of the inverse Bellman equation to recover rewards. Experiments can be added to reflect the performance of other baseline methods in reward recovery.

A3: We thank the reviewer for pointing out the need to include additional baselines for reward recovery. We will incorporate them accordingly.

[1] Freund, G. J., Sarafian, E., & Kraus, S. (2023, July). A coupled flow approach to imitation learning. In International Conference on Machine Learning (pp. 10357-10372). PMLR.

2024-11-23

We appreciate the reviewer’s constructive feedback and have revised the manuscript accordingly. Below, we address the reviewer’s concerns:

Discussion on high variance in estimated rewards.

A1: We have discussed this issue in lines 1208–1215 of the Appendix in the revised manuscript.

Additional experiments and baselines on reward recovery.

A2: We have included a comparison between our model and the baseline IQL+SAC in terms of reward recovery correlation, by calculating the Pearson correlation between the estimated and ground-truth rewards. The results are shown in Table 7 of Appendix G in the revised manuscript.

We will continue to revise the manuscript and welcome any further feedback or suggestions the reviewer may have.

2024-11-25

I appreciate the response and the revised paper from the authors. I intend to maintain my score and increase my confidence score.

2024-11-26

Thank you for your review and feedback. Please just let us know if we can further clarify anything.

审稿意见

评分: 6置信度: 52024-11-04

The paper looks at the problem of model-based inverse reinforcement learning. The authors propose a "reward-free" approach by combining reward-free model-free imitation learning (IQ-learn) with learned dynamics models. The proposed method, IQ-MPC, is shown to be more sample efficient in comparison to IQ-learn across a range of continuous control tasks.

The contributions are:

A novel model-based, reward-free IRL algorithm
Empirical validation across a range of both state-based and vision-based environments, including DMControl for locomotion, MyoSuite for dexterous hand manipulation, and ManiSkill2 for object handling
Ablations showing that the method is performant even as the number of expert trajectories is reduced.

优点

This paper proposes a new approach to interactive imitation learning that is model-based and reward-free. This results in sample-complexity wins (from the model) and stability (from the lack of adversarial reward training). The paper also shows that the methods scale across different environments, including visual imitation learning tasks.

Originality: The work presents a unique integration of IQ-Learn’s inverse soft-Q learning with model predictive control, enabling a reward-free formulation that sidesteps adversarial reward estimation.

Quality: Empirical results support the approach’s benefits in diverse settings, achieving stable, expert-level performance on a range of tasks from DMControl and MyoSuite to ManiSkill2.

Clarity: The paper is easy to follow, and the algorithm does a good job of clearly explaining training and inference procedures. The results section is well organized.

Significance: The proposed method is practical and the reward-free model-based learning is likely to scale well.

缺点

Fundamental technical concern: The nice thing about recovering reward functions in IRL is that rewards transfer while Q doesn't. The Q function entangles reward with environment dynamics. I would imagine as the model changes across iterations, the Q function in the worst case may fail to transfer. I would love to be convinced by math that this is not the case (it may not have manifested in the limited set of experiments). Can the authors provide theoretical analysis demonstrating how their IQ-MPC handles Q-function transfer across model iterations, or to discuss potential failure modes?

Comparison to baselines: The paper would benefit from broader comparisons, especially with other model-based and model-free IRL approaches, to support claims of stability and efficiency. For example, how does it compare to hybrid inverse reinforcement learning (both model-free and model-based) https://arxiv.org/pdf/2402.08848v1 which have SOTA results on Mujoco and D4RL environments. Could the authors run against the baselines in https://arxiv.org/pdf/2402.08848v1?

Computational overhead: IQ-MPC has a more expensive inference time process than the baselines, making the comparisons unfair. Perhaps the planner can be distilled into the same policy class to have a fair comparison with model-free baselines. Additionally, including the number of interactions in the model would be helpful. One would expect IQ-MPC to have lower interactions with environment, but perhaps a larger (world + model) interactions compared to IQ-Learn.

Demonstrations efficiency: The high number of expert trajectories required (100-500) contrasts with prior work, such as IQ-Learn’s ability to learn with as few as 5-10 trajectories in MuJoCo. It would be good to resolve this discrepancy, perhaps running IQ-Learn in the same mujoco setting as IQ-Learn and hybrid inverse reinforcement learning baselines.

Limited novelty: IQ-MPC appears as a blend of IQ-Learn’s inverse soft-Q learning and TD-MPC’s model predictive control framework, without substantial originality beyond this combination. While the reward-free setup is interesting, the paper could be strengthened by examining distinct advantages from this integration. This could be done via an apples-to-apples comparison with reward-based model-based IRL.

Scope of “State-Only” data: The mention of “state-only” data in the appraoch is misleading, as state-action pairs are needed to learn the Q-function and policy. Clarifying this would prevent confusion.

Incomplete related work: The paper does not cite fundamental papers like Ziebart’s max-entropy IRL framework or recent model-based IRL methods that directly relate to the current work. Including these citations would better contextualize IQ-MPC within the IRL landscape. Adding recent model-based IRL techniques to related work would help readers understand how IQ-MPC compares and advances beyond existing methods.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008, July). Maximum entropy inverse reinforcement learning. In Aaai (Vol. 8, pp. 1433-1438).
Ren, J., Swamy, G., Wu, Z. S., Bagnell, J. A., & Choudhury, S. (2024). Hybrid inverse reinforcement learning. arXiv preprint arXiv:2402.08848.
P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth. Model-based imitation learning by probabilistic trajectory matching. In 2013 IEEE International Conference on Robotics and Automation, pp.1922–1927. 2013. doi:10.1109/ICRA.2013.6630832.
A. Hu, G. Corrado, N. Griffiths, Z. Murez, C. Gurau, H. Yeo, A. Kendall, R. Cipolla, and J. Shotton. Model-based imitation learning for urban driving. arXiv preprint arXiv:2210.07729, 2022.
R. Kidambi, J. Chang, and W. Sun. Mobile: Model-based imitation learning from observation alone, 2021.
N. Baram, O. Anschel, and S. Mannor. Model-based adversarial imitation learning. Conference on Neural Information Processing Systems, 2016.
M. Igl, D. Kim, A. Kuefler, P. Mougin, P. Shah, K. Shiarlis, D. Anguelov, M. Palatucci, B. White, and S. Whiteson. Symphony: Learning realistic and diverse agents for autonomous driving simulation. arXiv preprint arXiv:2205.03195, 2022.
Z.-H. Yin, W. Ye, Q. Chen, and Y. Gao. Planning for sample efficient imitation learning. Conference on Neural Information Processing Systems, 2022.
R. Rafailov, T. Yu, A. Rajeswaran, and C. Finn. Visual adversarial imitation learning using variational models, 2021.
B. DeMoss, P. Duckworth, N. Hawes, and I. Posner. Ditto: Offline imitation learning with world models. 2023.
W. Zhang, H. Xu, H. Niu, P. Cheng, M. Li, H. Zhang, G. Zhou, and X. Zhan. Discriminator-guided model-based offline imitation learning. Conference on Robot Learning, 2023.

Edit: Increased score during rebuttal period

问题

The "Weakness" section contains all questions I had.

Other clarifications:

Line 155: there is no p_o in the equation above
Line 200-204: I failed to understand the argument at all. One is always free to learn dynamics models without reward?

2024-11-21

We thank the reviewer for the constructive review.

Q1: Theoretical Guarantee.

A1: Thank you for your constructive comments. To address your theoretical concern, we first borrow a bound from prior work [1]. According to Theorem 2 in [1], the sub-optimality is bounded:

Given an unknown MDP $\mathcal{M}$ and our learned MDP $\hat{\mathcal{M}}$ with transition probability $d$ and $\hat{d}$ in the latent space $\mathcal{Z}$ , and let $R\_\max$ be the maximum of the unknown reward in MDP, the value is bounded by: $|V^{\pi_E}\_\mathcal{M}-V^\pi\_\mathcal{M}|\leq\frac{2R\_{\max}}{1-\gamma}D\_{TV}(\rho^\pi\_{\hat{\mathcal{M}}},\rho^{\pi_E}\_{{\mathcal{M}}})+\frac{\gamma R\_{\max}}{(1-\gamma)^2}\mathbb{E}\_{\rho^\pi\_{\hat{\mathcal{M}}}}\Big[D\_{TV}(d(z'|z,a),\hat{d}(z'|z,a))\Big]$

Our critic and policy objectives can be interpreted as a min-max optimization of Eq. (4) in our manuscript, as outlined in lines 156-158 and 185-186. This approach can be viewed as minimizing a statistical distance with an entropy term, corresponding to Eq. (2) in our manuscript. Thus, our critic and policy objectives effectively minimize the first term, $D\_{TV}(\rho^\pi\_{\hat{\mathcal{M}}}, \rho^{\pi\_E}\_{{\mathcal{M}}})$ , in the bound above.

Our consistency loss minimization is minimizing the second term $\mathbb{E}\_{\rho^\pi\_{\hat{\mathcal{M}}}}\Big[D\_{TV}(d(z'|z,a),\hat{d}(z'|z,a))\Big]$ in the bound.

In the specific case of Gaussian dynamics, assuming that the transition probabilities $d$ and $\hat{d}$ are approximately Gaussian and that our estimated standard deviation $\hat{\sigma}$ is close to the actual standard deviation $\sigma$ , the consistency loss minimizes the upper bound of $\mathbb{E}\_{\rho^\pi\_{\hat{\mathcal{M}}}}\left[D\_{TV}(d(z'|z,a), \hat{d}(z'|z,a))\right]$ .

Using Pinsker inequality, we have: $\mathbb{E}\_{\rho^\pi\_{\hat{\mathcal{M}}}}\Big[D\_{TV}(d(z'|z,a),\hat{d}(z'|z,a))\Big]\leq\mathbb{E}\_{\rho^\pi\_{\hat{\mathcal{M}}}}\sqrt{\frac{1}{2}D\_{KL}(d(z'|z,a),\hat{d}(z'|z,a))}$

With the assumptions, we can represent the KL divergence by mean and standard deviation: $D_{KL}(d(z'|z,a),\hat{d}(z'|z,a))=\log\frac{\hat\sigma}{\sigma}+\frac{\sigma^2+(\mu-\hat\mu)^2}{2\hat\sigma^2}-\frac{1}{2}\approx\frac{(\mu-\hat\mu)^2}{2\sigma^2}$

Given a predicted latent state $\hat{z}'$ from the learned dynamics $\hat{d}$ and an actual latent state $z' = h(s')$ encoded from a state observation with unknown dynamics $d$ , minimizing the L2 loss approximately minimizes the distance between the means of the Gaussian distributions. This, in turn, approximately minimizes the right-hand side of the Pinsker inequality. Consequently, our consistency loss minimizes the statistical distance between the dynamics.

In conclusion, our training objective ensures that as the dynamics model learns, it simultaneously minimizes the upper bound of the deviation between the value function $V^\pi$ and the expert value $V^{\pi_E}$ . Given that the value function is computed by $V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)} [Q(s,a) - \log \pi(a|s)]$ , this also guarantees that the Q function can follow when the dynamics model is being optimized.

However, in the reviewer's question, the reviewer claimed that rewards learned from IRL methods can be "transferred", which we interpret as meaning transferable across different dynamics. However, IRL methods such as GAIL are dynamics-aware [2], as they match state-action distributions rather than policy distributions. Therefore, the learned rewards should, to some extent, contain information about the dynamics. If the explanation above doesn’t fully address your question, could you please provide further clarification?

Q2: Comparison to baselines:

A2: We provide the comparison with the model-free version of hybrid IRL (HyPE). We present the experimental results for the extra baseline in the additional materials available in Section 1 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

2024-11-21

Q3: Computational overhead:

A3: Experiments in Appendix E.1 show that our model converges faster compared to IQ-Learn in high-dimensional dog setting, which demonstrates better online sampling efficiency.

We present the experimental results for the computational time consumption in Section 2 in the additional materials available in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

We would also like to emphasize that although our model requires extensive rollouts, it is a latent model rather than a model of the environment. Therefore, the computation can be parallelized using CUDA, making it much more efficient.

Regarding policy distillation, we can directly interact with the environment using the policy prior $\pi$ , which is more computationally efficient. While this approach may affect performance stability in certain environments (such as the Dog environment), the performance difference when using only the policy prior is not significant in simpler settings.

Q4: Demonstration efficiency:

A4: We show that our model can still have expert-level performance with only a small number of expert trajectories in Figure 6 in our manuscript. In details, our model can reach expert with 10 demos in Hopper Hop for locomotion task and with 5 demos in Object Hold for dexterous hands manipulation task.

Q5: Scope of "State-Only" data:

A5: We apologize for the misleading statement. By 'state-only' data, we are referring to interactions with the environment that involve only the state information, without the explicit reward signals. However, expert demonstrations do indeed include action data. We will change the wording in revised manuscript.

Q6: Incomplete related work:

A6: Thank you for providing the information on related works. We will include them in the related works section of our manuscript.

[1] Kolev, V., Rafailov, R., Hatch, K., Wu, J., & Finn, C. (2024). Efficient Imitation Learning with Conservative World Models. arXiv preprint arXiv:2405.13193.

[2] Garg, D., Chakraborty, S., Cundy, C., Song, J., & Ermon, S. (2021). Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34, 4028-4039.

2024-11-23

We appreciate the reviewer’s constructive feedback and have revised the manuscript accordingly. Below, we address the reviewer’s concerns:

Fundamental technical concern

A1: In response to the reviewer’s technical concern, we have added the theoretical analysis of our approach in Section 4.2 and Appendix H.3 of the revised manuscript.

Comparison to baselines

Computational overhead

A3: We have added an analysis of the additional computational time consumption in Appendix F of the revised manuscript.

Scope of “State-Only” data

A4: We have clarified the wording on this issue in lines 195–196 of the revised manuscript.

Incomplete related work

A5: We have included additional related works in lines 101–113 in the revised manuscript, highlighted in blue.

We will continue to revise the manuscript and welcome any further feedback or suggestions the reviewer may have.

2024-11-26

Thank you for engaging with the review, providing detailed responses, running extra experiments, and adding missing references.

I am happy to increase the score to 6. But I would like to note two remaining concerns:

Fundamental technical concern The authors discuss the simulation lemma to justify the model learning approach, however, my question was not quite related to that. Put simply: In model-based IRL, we have two players - a reward player and a policy player. At every iteration, the policy player rolls out the policy in the real world and uses that data to update the world model. The reward player updates the reward to discriminate between the policy and the expert. The policy players plays best response by optimizing the new reward in the world model (by planning).

Crucially, the reward here is disentangled from the world model. So as the world model changes within the IRL iterations, the reward can optimized in the new model. However, if you are directly learning a Q function, it entangles the reward and the model. So as the model changes across IRL iterations, the Q function ends up "averaging" over models. IQ learn, being model free with stationary dynamics does not have this issue. But I worry this might be an issue for IQ-MPC.

Computational overhead I appreciate the timing plots. I am still concerned that it's unfair to compare planning at inference time to a policy at inference time, and making that tradeoff clear in the results section is important.

2024-11-27

We sincerely thank the reviewer for their thoughtful feedback and clarification on the technical concerns, as well as for the increased score of our submission.

Technical Concerns

The reviewer raised two key technical concerns:

C1. The reviewer suggested that, in model-based IRL, the learned rewards are disentangled from the learned model and can transfer as the model evolves.

C2. The reviewer further pointed out that learning a Q-function instead of rewards might inherently include model information, potentially causing the Q-function to represent an "average" over the changing models during training iterations.

We address these concerns with our perspectives as follows:

Response to C1

On the dynamics-awareness of learned rewards:
We respectfully suggest that when rewards in model-based IRL are learned through the adversarial imitation learning objective (e.g., GAIL), they are inherently dynamics-aware. This is because the objective minimizes the statistical distance between behavioral and expert state-action distributions, $d_\psi(\rho_\pi, \rho_E)$ , with an entropy regularization term. Since the state-action distribution captures dynamics information, the learned rewards also reflect these dynamics. Furthermore, [1] explicitly states in Table 1 that GAIL is dynamics-aware.
On the equivalence between learning Q-functions and rewards:
We also humbly propose that, given the bijective nature of the inverse Bellman operator, there is a direct correspondence between each Q-value and a reward value for a given policy. From the perspective of state-action distribution matching, learning a Q-function is therefore equivalent to learning a reward function, although the training procedures may differ.

Response to C2

On the convergence of the Q-function:
We respectfully hold a different view with the suggestion that the Q-function "averages" over models during training. Instead, we argue that the Q-function converges to the saddle point of the min-max optimization problem defined with the final learned model.
Details on Q Function Learning
In our model, the Q function is updated using rollouts generated by the current latent model. These rollouts occur in the latent space and are not stored in the replay buffer. Instead, only trajectories obtained from actual interactions with the environment are saved to the behavioral buffer. As a result, the data in the buffers always reflect the true dynamics of the environment. Over time, as the latent model converges, the Q function also converges to the optimal Q-value corresponding to the current (converged) latent model, rather than representing an "average" of previous latent models.
Comparison with Other Model-Based Approaches
If the Q function were truly averaging over models during training, a similar issue would arise in TD-MPC2 [2] and other model-based methods that jointly learn the model and the Q function. In TD-MPC2, the Q function is also trained using unrolled latent states produced by the evolving learned model, rather than by a fixed model. This suggests that the same concern could theoretically apply to TD-MPC2. However, empirical results indicate that TD-MPC2 does not exhibit this issue.

Computational Overhead

We also appreciate the reviewer’s suggestion to clarify the computational overhead in the results section. In response, we have revised the manuscript to include a clarification in lines 373–375, highlighted in purple.

Once again, we are very grateful for the reviewer’s thoughtful feedback and would be happy to provide further clarification if needed. Please do not hesitate to reach out with any additional questions or comments.

[1] Garg, D., Chakraborty, S., Cundy, C., Song, J., & Ermon, S. (2021). Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34, 4028-4039.

[2] Hansen, Nicklas, Hao Su, and Xiaolong Wang. "Td-mpc2: Scalable, robust world models for continuous control." arXiv preprint arXiv:2310.16828 (2023).

2024-11-29

We draw upon the concept of "reward ambiguity" in inverse reinforcement learning (IRL) as discussed in [3] to reinforce our first point in response to C1 from our previous reply. As noted in [3], a reward function $r(s_t,a_t,s_{t+1})$ retains its optimality under the following transformation: $\tilde r(s_t,a_t,s_{t+1})=r(s_t,a_t,s_{t+1})+\gamma\Phi(s_{t+1})-\Phi(s_t)$ where $\Phi(s)$ represents an arbitrary potential function. Given a transition function $\mathcal{P}(s,a):\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}$ , the transformed state-action reward can be expressed as: $\tilde r(s_t,a_t)=r(s_t,a_t)+\gamma\Phi(\mathcal{P}(s_t,a_t))-\Phi(s_t)$ If an $(s_t,a_t,s_{t+1})$ tuple arises from dynamics that deviate from $\mathcal{P}(s,a)$ , it can affect policy optimality. Consequently, the state-action rewards generated by AIL and many previous IRL methods are inherently dependent on the dynamics.

We sincerely appreciate the reviewer’s thoughtful feedback and are more than happy to provide further clarification if needed. Please feel free to reach out with any additional questions or comments.

[3] Luo, F. M., Cao, X., Qin, R. J., & Yu, Y. (2022). Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble. arXiv preprint arXiv:2206.00238.

审稿意见

评分: 5置信度: 52024-11-06

This paper introduces an extension of the TD-MPC framework for imitation, specifically within the inverse reinforcement learning context. Rather than learning an explicit reward function, the authors propose learning implicit rewards in the Q-function space. They use the implicitly learned reward function, along with learned world models, to perform MPC in the latent space. The results demonstrate state-of-the-art performance when compared to other baselines.

优点

the idea is interesting.
I believe that (when done right), this is a scaleable approach.

缺点

the paper is written poorly. Many conclusion done lack causality and need further clarification (see below).
mathematically there seem to be many issues (see questions)
I believe many experiments use badly tuned baselines (see questions)

Problematic Causality in writing:

“By utilizing an inverse soft-Q learning objective for the critic network (Garg et al., 2021), our method derives rewards directly from Q-values and the policy, effectively rendering the world model reward-free. This novel formulation mitigates key challenges in IL, including out-of-distribution errors and bias accumulation.” → Inverse Q-learning is not the novel formulation that mitigates these problems, these are general advantange of inverse RL.
“ In the objective function, p0 is the initial distribution of states. In this way, we can perform imitation learning by leveraging actor-critic architecture.” → unclear why the initial state distribution should allow to leverage actor-critic architectures. Also, p0 is not mentioned anywhere in the equations.
“, and α is a scalar coefficient to ensure stable training and prevent overfitting.” thats a bit unspecific. More explanation is needed or add a citation.
Section 4, first paragraph. The storyline is not very clear. “In conventional latent world models, an explicit reward function is typically used to map latent representations and actions to observed reward data. However, in imitation learning, the model is limited to learning from a finite set of expert state-action demonstrations, making it difficult for traditional world models to directly perform imitation learning. To address this limitation, we propose a reward-free world model that learns solely from expert demonstrations and environment interactions, without requiring explicit reward data. ” The authors first note that rewards are often learned alongside world models during training. They then argue that in imitation learning, conventional world models are typically used without rewards and only on expert data (not true), making imitation challenging. To address this, the authors propose their solution: training a world model without rewards.

问题

Equation (4): There seem to be an error in the expectation over the policy state-action distribution. What is V(s) should be Q(s, a). Or some part of the derivation is missing.
“Compared to Eq.6, the key difference is the second term of the objective, which computes the original value difference E(st,at,s ′ t )∼Bπ [V π (zt) − γV π (z ′ t )] using only the representation of the initial state s0.” This sentence does not explain the transition from Eq. 6 to 12. And still What is V(s) should be Q(s, a).
In Eq. 12, I think that the sum should be inside the expectation.
In Eq. 15, the V π (zt) − γV π (z ′ t) is used for the reward. This can not be a valid reward signal, since it is just equivalent to a reward shaping term, which is policy invariant.
On what task was the ablation study done, high-dim or low dim?
I believe there is an important work missing [1]. This work provides a bridge between SQIL and IQ-Learn, showing that the underlying problems are equivalent, a provide a solution to the instability also encountered during this paper.

Minor Things:

Line 212: typo → Q twice

[1] Al-Hafez et. al. LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning

2024-11-21

We thank the reviewer for the constructive review.

The reviewer primarily raised concerns in three areas: writing, mathematical formulations, and other aspects such as baselines and related work. However, we respectfully disagree with the reviewer’s claims regarding the alleged inaccuracies in the mathematical formulations, including $V^\pi(s)$ being mistaken for $Q(s, a)$ , errors in the reward term, and the arrangement of expectations and sums. We address these concerns in detail in responses A2, A3, A4, and A5.

Writing Concerns

Q1: Problematic writing.

Q1.1: “By utilizing an inverse soft-Q learning objective for the critic network (Garg et al., 2021), our method derives rewards directly from Q-values and the policy, effectively rendering the world model reward-free. This novel formulation mitigates key challenges in IL, including out-of-distribution errors and bias accumulation.” → Inverse Q-learning is not the novel formulation that mitigates these problems, these are general advantange of inverse RL.

A1.1: We acknowledge that this statement is not entirely accurate and will revise the manuscript accordingly.

Q1.2: “ In the objective function, p0 is the initial distribution of states. In this way, we can perform imitation learning by leveraging actor-critic architecture.” → unclear why the initial state distribution should allow to leverage actor-critic architectures. Also, p0 is not mentioned anywhere in the equations.

A1.2: We thank the reviewer for bringing this issue to our attention. The statement "In the objective function, $p_0$ is the initial distribution of states" should be removed. We will revise the manuscript accordingly.

Q1.3: “, and α is a scalar coefficient to ensure stable training and prevent overfitting.” thats a bit unspecific. More explanation is needed or add a citation.

A1.3: We thank the reviewer for highlighting the potentially unclear statement. Regarding the hyperparameter $\alpha$ , we have conducted additional ablation studies to evaluate its impact. The results have been included in Section 3 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

Q1.4: Section 4, first paragraph. The storyline is not very clear. “In conventional latent world models, an explicit reward function is typically used to map latent representations and actions to observed reward data. However, in imitation learning, the model is limited to learning from a finite set of expert state-action demonstrations, making it difficult for traditional world models to directly perform imitation learning. To address this limitation, we propose a reward-free world model that learns solely from expert demonstrations and environment interactions, without requiring explicit reward data. ” The authors first note that rewards are often learned alongside world models during training. They then argue that in imitation learning, conventional world models are typically used without rewards and only on expert data (not true), making imitation challenging. To address this, the authors propose their solution: training a world model without rewards.

A1.4: We apologize for the unclear statement. Previous world models typically utilize an explicit reward model to fit reward data. In reinforcement learning, this often involves regressing the reward signal retrieved from the environment, whereas in online imitation learning, it is commonly achieved through adversarial training to distinguish between expert and behavioral demonstrations. In contrast, our model eliminates the need for an explicit reward model. Instead, we decode the reward directly from an adversarially trained critic. This represents a key distinction between our approach and prior work. We will revise this section to clarify accordingly.

2024-11-21

Mathematical Concerns

Q2: Equation (4): There seem to be an error in the expectation over the policy state-action distribution. What is V(s) should be Q(s, a). Or some part of the derivation is missing.

A2: We apologize for any confusion, but we believe there may be a misunderstanding regarding this objective. The value function $V^\pi(s)$ mentioned in the question should not be $Q(s,a)$ in the objective, as the additional entropy term $\mathcal{H}(\pi)$ in Eq. 1 plays a crucial role in the formulation.

Regarding the derivation of Eq. 4, the objective is derived from Eq. 1 using the inverse Bellman operator defined in Eq. 3 and $V^\pi$ , which is introduced in the paragraph following Eq. 3.

The objective in Eq. 1 is:

$\mathbb{E}\_{\rho\_E}[r(s,a)] - \left[\mathbb{E}\_{\rho\_\pi}[r(s,a)] + \mathcal{H}(\pi)\right] - \psi(r)$

which is equivalent to:

$\mathbb{E}\_{\rho_E}[r(s,a)] - \left[\mathbb{E}\_{\rho\_\pi}[r(s,a)] - \mathbb{E}_{\rho\_\pi} \log(\pi(a|s))\right] - \psi(r)$

The first term in Eq. 1, $\mathbb{E}\_{\rho\_E}[r(s,a)]$ , can be transformed into the first term in Eq. 4, $\mathbb{E}\_{\rho\_E}[Q(s,a) - \gamma \mathbb{E}\_{s' \sim \mathcal{P}(\cdot|s,a)} V^\pi(s')]$ , via the inverse Bellman operator.

In the second term, $\left[\mathbb{E}\_{\rho_\pi}[r(s,a)] - \mathbb{E}\_{\rho_\pi} \log(\pi(a|s))\right]$ , since we have a $\log \pi$ term in addition to the expectation $\mathbb{E}\_{\rho\_\pi} r(s,a)$ , by employing the inverse Bellman operator and the definition of the value function $V^\pi(s) = \mathbb{E}\_{a \sim \pi(\cdot|s)} [Q(s,a) - \log \pi(a|s)]$ , we can derive:

$\mathbb{E}\_{\rho\_\pi}[r(s,a)] - \mathbb{E}\_{\rho\_\pi} \log(\pi(a|s))= \mathbb{E}\_{\rho\_\pi}[Q(s,a) - \gamma \mathbb{E}\_{s' \sim \mathcal{P}(\cdot|s,a)} V^\pi(s')]-\mathbb{E}_{\rho\_\pi} \log(\pi(a|s))= \mathbb{E}\_{(s,a) \sim \rho\_\pi} [V^\pi(s) - \gamma \mathbb{E}\_{s' \sim \mathcal{P}(\cdot|s,a)} V^\pi(s')].$

Finally, we convert the regularizer $\psi(r)$ into $\psi(\mathcal{T}^\pi Q)$ using the inverse Bellman operator.

Q3: “Compared to Eq.6, the key difference is the second term of the objective, which computes the original value difference E(st,at,s ′ t )∼Bπ [V π (zt) − γV π (z ′ t )] using only the representation of the initial state s0.” This sentence does not explain the transition from Eq. 6 to 12. And still What is V(s) should be Q(s, a).

A3: We apologize for the confusion. The transformation can be described in the following details:

In Eq. 12, we use expectations over batches sampled from the buffers $\mathcal{B}\_E$ and $\mathcal{B}\_\pi$ to approximate the expectations over $\rho\_E$ and $\rho\_\pi$ in Eq. 6.
In Eq. 12, we compute this objective using latent representations $z = h(s)$ instead of the actual states $s$ in Eq. 6, as stated in lines 236-237.
Regarding the derivation of the term $\mathbb{E}_{s_0 \sim \mathcal{B}_E}[(1-\gamma)V^\pi(z_0)]$ in Eq. 12 from its original form in Eq. 6, we refer to Appendix F.1 of our manuscript. This is also mentioned in lines 247-248 of the main text.
As mentioned in lines 193-194, we employ $\chi^2$ regularization in Eq. 6. Therefore, we use $\phi(x) = x - \frac{1}{4\alpha}x^2$ , where the additional $-\frac{1}{4\alpha}x^2$ corresponds to the third term $\mathbb{E}_{(s,a,s') \sim \mathcal{B}} \frac{1}{4\alpha}[Q(z_t, a_t) - \gamma \bar{V}^\pi(h(s'_t))]^2$ in Eq. 12. Empirically, we find it better to compute this term using both expert and behavioral batches, so we sample from $\mathcal{B} = \mathcal{B}\_E \cup \mathcal{B}\_\pi$ rather than only from $\mathcal{B}\_E$ in Eq. 12.
As noted in lines 253-254, we compute the target value function $\bar{V}^\pi$ using the target Q network $\bar{Q}(z,a)$ in Eq. 12, which is maintained via soft updates.
In Eq. 12, we sum the inverse soft-Q loss over a horizon $H$ using the factor $\lambda$ , whereas in Eq. 6, it only represents the loss for a single timestep, as mentioned in lines 236-237 and 249-251.
The $V^\pi(s)$ in Eq. 12 should not be $Q(s,a)$ , and the reasons are shown in the answer A2.

Q4: In Eq. 12, I think that the sum should be inside the expectation.

A4: These two formulations are equivalent with only notation difference. We can always interchange the sum and expectation under conditions like Fubini's Theorem. In our notation, at each timestep $t$ , we sample a batch from the buffer and compute the inverse soft-Q loss for that timestep. We then sum the losses from $t=0$ to $t=H$ . This notation is aligned with our actual implementation.

2024-11-21

Q5: In Eq. 15, the V π (zt) − γV π (z ′ t) is used for the reward. This can not be a valid reward signal, since it is just equivalent to a reward shaping term, which is policy invariant.

A5: We apologize for the confusion. We should emphasize that this is not a reward term. As we mentioned in lines 325-326, instead of using the return-maximizing objective from TD-MPC2, we adopt a soft-Q learning objective, $\text{argmax}_\pi \sum_t \gamma^t [r(z_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|z_t))]$ , since our learned Q function is a soft Q function. Specifically, $V^\pi(z_t) - \gamma V^\pi(z'_t)$ computes $r(z_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|z_t))$ . The detailed planning algorithm is provided in Algorithm 1.

Other Concerns

Q6: On what task was the ablation study done, high-dim or low dim?

A6: We provide the details of our environments in Appendix D with dimensional information. In total, we conducted three ablation studies: one in the main manuscript and two in Appendix E.3.

For the ablation on the number of expert trajectories (Figure 6), we evaluated one low-dimensional setting (Hopper Hop) and one high-dimensional setting (Object Hold).

For the ablation on objective formulation (Figure 12), we used a high-dimensional setting (Humanoid Walk).

For the ablation on gradient penalty (Figure 13 and Table 6), we evaluated a low-dimensional manipulation setting (Pick Cube).

Q7: I believe there is an important work missing [1]. This work provides a bridge between SQIL and IQ-Learn, showing that the underlying problems are equivalent, a provide a solution to the instability also encountered during this paper.

A7: Thank you for your suggestion regarding the instability problem. We will include this paper in our related work section.

Q8: I believe many experiments use badly tuned baselines.

A8: We conducted experiments in various high-dimensional settings, including the relatively complex tasks of locomotion and dexterous hand manipulation. Previous methods struggle to handle these settings, leading to poor performance. In standard reinforcement learning settings (with reward signals from the environment), these tasks are also challenging for model-free agents such as SAC. The TD-MPC2 paper [2] compares model-free methods with world models on these tasks in reinforcement learning settings with reward signals from the environment. We also added an additional baseline HyPE [1]. We present the new comparsion results in Section 1 in the anonymous repository: https://anonymous.4open.science/r/rebuttal-reward-free.

Q9: Line 212: typo → Q twice

A9: It's not a typo. The second Q is written in calligraphy， which represents the Q space. This notation is defined in the preliminary section in line 126.

[1] Ren, J., Swamy, G., Wu, Z. S., Bagnell, J. A., & Choudhury, S. (2024). Hybrid inverse reinforcement learning. arXiv preprint arXiv:2402.08848.

[2] Hansen, N., Su, H., & Wang, X. (2023). Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828.

2024-11-23

We appreciate the reviewer’s constructive feedback and have revised the manuscript accordingly. Below, we address the reviewer’s concerns:

1: Problematic Causality in writing

1.1: Misleading claim about inverse soft-Q learning in IRL.

A1.1: We have revised the relevant section of the introduction, specifically lines 65–66, as highlighted in blue in the updated manuscript.

1.2: Problem with statement containing $p_0$ .

A1.2: To address this issue, we revised lines 162–163 in the updated manuscript, with the changes highlighted in blue.

1.3: Lack of explanation for $\alpha$ .

A1.3: We addressed this issue by revising lines 179–183 in the updated manuscript. Additionally, we conducted an ablation study on $\alpha$ , with the results presented in Figure 14 of Appendix E.3.

1.4 Misleading statement in Section 4.

A1.4: We addressed this issue by revising the first paragraph of Section 4, with the changes highlighted in blue.

Important work missing.

A2: We mentioned LS-IQ's new perspective on the regularizer in lines 182–184 in the revised manuscript.

We will continue to revise the manuscript and welcome any further feedback or suggestions the reviewer may have.

2024-12-01

Dear Authors,

Thank you for addressing my concerns. Many of them have been resolved. However, I still have a few remaining points:

Q5: In Eq. 15, the V π (zt) − γV π (z ′ t) is used for the reward. This can not be a valid reward signal, since it is just equivalent to a reward shaping term, which is policy invariant.

A5: We apologize for the confusion. We should emphasize that this is not a reward term. As we mentioned in lines 325-326, instead of using the return-maximizing objective from TD-MPC2, we adopt a soft-Q learning objective, $\text{argmax}_\pi \sum_t \gamma^t [r(z_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|z_t))]$ , since our learned Q function is a soft Q function. Specifically, $V^\pi(z_t) - \gamma V^\pi(z'_t)$ computes $r(z_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|z_t))$ . The detailed planning algorithm is provided in Algorithm 1.

More clarification is needed here. The authors claim that $V^\pi(z_t) - \gamma V^\pi(z'_t)$ is a valid reward because it is not estimating a reward, but a reward plus the entropy. Though i think that it does not matter what $V^\pi(z_t) - \gamma V^\pi(z'_t)$ approximates, it remains a shaping term, while maximized like a reward. I hope the authors can further clarify this.

Q1: Problematic writing.

Q1.1: “By utilizing an inverse soft-Q learning objective for the critic network (Garg et al., 2021), our method derives rewards directly from Q-values and the policy, effectively rendering the world model reward-free. This novel formulation mitigates key challenges in IL, including out-of-distribution errors and bias accumulation.” → Inverse Q-learning is not the novel formulation that mitigates these problems, these are general advantange of inverse RL.

A1.1: We acknowledge that this statement is not entirely accurate and will revise the manuscript accordingly.

Q1.2: “ In the objective function, p0 is the initial distribution of states. In this way, we can perform imitation learning by leveraging actor-critic architecture.” → unclear why the initial state distribution should allow to leverage actor-critic architectures. Also, p0 is not mentioned anywhere in the equations.

I would just remove the newly added lines in the revisions, there is neither a need nor a benefit for adding them.

Finally, please adapt the plots and make them more readable. Both, the axis labels and the legends are too small to be read.

2024-12-01

We sincerely thank the reviewer for the thoughtful feedback. We are pleased to hear that many of the previous concerns have been resolved. In the response, the reviewer raised two points that require further revision or clarification:

Further clarification on the term $V^\pi(z_t) - \gamma V^\pi(z_{t+1})$ .
Additional concerns regarding writing and figure labels.

We provide our detailed responses below:

A1: Regarding the term $V^\pi(z_t) - \gamma V^\pi(z_{t+1})$ , the reviewer suggested it represents a reward shaping term that is policy-invariant. However, we respectfully hold a different view and suggest that this term is not policy-invariant.

As noted in our previous response, the term $V^\pi(z_t) - \gamma V^\pi(z_{t+1})$ evaluates to $r(z_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | z_t))$ . In Eq. (15) of our manuscript, this expression can be understood as the expectation over sequences of $a_t$ sampled from the Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$ , applied to the term $r(z_t, a_t) - \alpha \log(\pi(a_t | z_t))$ .

Thus, the MPPI planning process effectively optimizes

\mathbb{E}\_{a_t \sim \mathcal{N}(\mu, \sigma^2)} [r(z_t, a_t) - \alpha \log(\pi(a_t | z_t))]

over multiple steps, where the mean $\mu$ and standard deviation $\sigma$ are parameters optimized during the MPPI planning process. Additionally, in the term $V^\pi(z_t) - \gamma V^\pi(z_{t+1})$ , the next state $z_{t+1}$ is unrolled using the action sequence sampled from $\mathcal{N}(\mu, \sigma^2)$ and the model, as described in lines 360–361 of the revised manuscript.

Consequently, the expression

\mathbb{E}\_{a_t \sim \mathcal{N}(\mu, \sigma^2)} [r(z_t, a_t) - \alpha \log(\pi(a_t | z_t))]

depends on the action sequences sampled from $\mathcal{N}(\mu, \sigma^2)$ . The Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$ can be interpreted as a new "policy" optimized during the planning process. Therefore, the term $V^\pi(z_t) - \gamma V^\pi(z_{t+1})$ is dependent on this newly optimized "policy," making it neither policy-invariant nor a purely reward-shaping term. While it shares a similarity in the mathematical form with the reward shaping term $\Phi(z_t) - \gamma \Phi(z_{t+1})$ , its underlying meaning is fundamentally different.

If this explanation does not fully address your concerns, could you please clarify your definition of "policy invariance" and "reward-shaping term"? This would allow us to better understand and respond to your feedback.

A2: We appreciate the reviewer’s observation that newly added lines should be removed and that the figure labels need to be larger. We fully agree with these suggestions and will implement the revisions. However, since the manuscript update deadline has passed, we will incorporate these improvements into the final version.

We thank the reviewer again for the valuable comments and are happy to provide further clarifications if needed. Please feel free to reach out with any additional questions or concerns.

评论- Reviewer-author discussion

2024-11-26

Dear Reviewers,

We sincerely thank you for taking the time to review our paper and for providing thoughtful and constructive feedback. We have put considerable effort into addressing your concerns in our rebuttal, and we hope we have clarified all the issues raised.

As the discussion phase deadline approaches, please don’t hesitate to reach out if further clarification or additional experiments are needed.

Thank you once again for your time and engagement!

Best regards,

Authors of Submission 8939

评论- General Response

2024-11-30

Dear Reviewers,

We sincerely thank you for your valuable feedback and constructive suggestions, which have significantly enhanced the quality of our work.

We highlight the key strengths recognized by the reviewers:

Novelty (Reviewer D7Rz and eRye): The work introduces a novel reward-free formulation of world models, integrating an inverse soft-Q objective to address diverse imitation learning tasks.
Solid Experimental Results (Reviewer D7Rz, eRye and yQJv): The approach is thoroughly evaluated across a variety of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating stable, expert-level performance. Comprehensive ablation studies provide insightful analysis of the proposed method's components.
Clarity and Structure (Reviewer D7Rz and yQJv): The paper is well-written, with a clear and structured presentation, making the approach and its underlying formulation easy to understand.

The reviewers' concerns primarily focus on the following three aspects:

Mathematical concerns and additional theoretical guarantees. (Reviewer 4dPD and D7Rz)
Additional experimental requirements, including comparisons with new baselines, ablation studies, robustness in noisy environments, reward correlation analysis, and computational overhead evaluation. (All the reviewers)
Unclear writing and incomplete related work coverage. (Reviewer 4dPD, D7Rz and yQJv)

To address these concerns, we have conducted further analysis and revised the manuscript as summarized below:

Mathematical Concerns and Theoretical Guarantees:
We have clarified the mathematical concerns raised by Reviewer 4dPD and provided additional theoretical analysis of our method. The theoretical analysis is detailed in Section 4.2 and Appendix H.3 of the revised manuscript.
Additional Experiments:
We have added baseline comparisons for six locomotion tasks and three manipulation tasks, as shown in Figure 2, Figure 3, and Table 1. Furthermore, we conducted additional experiments addressing ablation studies, robustness in noisy environments, reward correlation, and computational overhead. These results are presented in Appendix E, Appendix F, and Appendix G of the revised manuscript.
Writing and Related Works:
We have revised the manuscript to improve clarity and incorporated related works suggested by the reviewers. These updates are reflected in Sections 2, 3, and 4 of the revised manuscript.

We are grateful for your thorough review process and have worked diligently to address the concerns raised. We hope these revisions and the accompanying detailed responses adequately address your feedback, further strengthening the manuscript. Thank you again for your insightful suggestions and thoughtful engagement with our work.

Best regards,
Authors of Submission 8939

评论- Reviewer-author discussion

2024-12-03

Dear Reviewers,

As the discussion phase deadline (December 2nd) approaches, please don’t hesitate to reach out if further clarification is needed.

Thank you once again for your time and engagement!

Best regards,

Authors of Submission 8939

AC 元评审

2024-12-21

The paper proposes IQ-MPC, a model-based, reward-free imitation learning framework that combines inverse soft-Q learning (IQ-Learn) with model predictive control (MPC). Instead of explicitly learning reward functions, IQ-MPC operates in the Q-function space and employs latent dynamics models for planning. The framework is evaluated across diverse domains (e.g., DMControl, MyoSuite, ManiSkill2), showing sample improved efficiency and scalability for both state-based and visual imitation learning tasks. Ablation studies highlight the model's robustness to reduced expert trajectories.

Reasons to accept

IQ-MPC does not rely on explicit reward functions, addressing key challenges of instability and reward bias in imitation learning.
The approach performs well across state-based and vision-based tasks.
Comprehensive experiments on various benchmarks show satisfactory performance and improved sample efficiency compared to baselines.
The paper provides a clear explanation of the proposed method, training process, and inference procedures.
Thoughtful ablations examine the impact of design choices and varying numbers of expert trajectories.

Reasons to reject

The method heavily builds on existing work (IQ-Learn and TD-MPC) with limited new theoretical insights. The contribution is mainly in combining these frameworks.
Broader comparisons with other model-based and model-free IRL approaches are missing, particularly to recent state-of-the-art hybrid methods.
Computational overhead and interaction efficiency comparisons are underexplored.
IQ-MPC requires significantly more expert demonstrations (100-500) compared to prior works like IQ-Learn, which can learn with as few as 5-10.
Many related works were missing. (addressed by the rebuttal)
The reviewers were concerned with the quality of writing. (mostly fixed by the revised paper)
During the AC-reviewer discussion phase, multiple reviewers shared the concern that the proposed method resembles a potential-based reward shaping term, and optimizing such a term should not alter the optimal policy learned.

While this paper studies a promising research direction and presents an interesting approach, I believe its weaknesses outweigh its strengths. Consequently, I recommend rejecting the paper.

审稿人讨论附加意见

During the rebuttal period, all four reviewers acknowledged the author's rebuttal, and two reviewers adjusted the score accordingly.

最终决定Reject

2025-01-22

Reject