There is a typo in the caption of Figure 2: "the model can show have multiple capabilities..."
The legend in Figure 7 is unclear. Could the authors link the curves with methods (a), (b), (c), and (d) in the text?
The motivation for the constraint in Eq. (2) is not adequately explained. The trajectory-level entropy constraint could lead the policy to sample out-of-distribution actions, which is problematic in offline RL and may harm performance. It would be better to compare and explain the results for a version without the trajectory-level entropy.
In the related work, some model-based offline planning methods have not been discussed, such as MBOP (Argenson and Dulac-Arnold, 2021), MOPP (Zhan et al., 2022), and GOPlan (Wang et al., 2024). Notably, MOPP and GOPlan consider uncertainty during planning and tend to remove/prune trajectories with high uncertainty.
It would be useful to report the computation time for MPC. Typically, test-time planning methods require more computation than actor-critic methods, such as TD3-BC and IQL. Could the authors also add test-time planning baselines to Table 1?
In the ablation study, the training details of the policy model and the world model should be provided. Q3 shows that a unified model performs better than the separated models, but it would be better to further demonstrate that the proposed unified model outperforms those in other papers. For example, the well-known world model RSSM (Hafner et al., 2019) could be compared.

References

Argenson, A., & Dulac-Arnold, G. (2021). Model-Based Offline Planning. ICLR.

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML.

Wang, M., Yang, R., Chen, X., Sun, H., Fang, M., & Montana, G. (2024). GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models. TMLR.

Zhan, X., Zhu, X., & Xu, H. (2022). Model-Based Offline Planning with Trajectory Pruning. IJCAI.