PaperHub
6.6
/10
Poster5 位审稿人
最低5最高8标准差1.2
8
6
6
8
5
4.2
置信度
正确性3.6
贡献度2.8
表达3.2
ICLR 2025

Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

OpenReviewPDF
提交: 2024-09-18更新: 2025-03-24
TL;DR

We scale offline model-based RL through a jointly-optimized world-action model pretrained across multiple games, which achieves sample-efficient transfer to novel games.

摘要

关键词
reinforcement learningoffline reinforcement learningworld model

评审与讨论

审稿意见
8

The paper focuses on the important research question: "Can image observation-based world model scale offline RL across multiple tasks while enhancing generalization to diverse unseen tasks?". The paper presents JOWA, which 1) jointly trains a Transformer model that includes world model, policy, critic using the offline data, 2) planning with beam-search in inference-time using the trained world model. Experimental results in Atari tasks shows that JOWA trained with 10% of subsampled 15 tasks significantly outperforms previous multi-task offline RL approaches not only in in-distribution tasks, but also in fine-tuning with 5k expert demonstrations. Moreover, JOWA shows favorable scaling property respect to the number of parameters. Exhaustive ablation studies show the importance of individal components of JOWA.

优点

S1. JOWA significantly outperforms previous multi-task offline RL approaches.

  • Experimental results of Table 2 shows that JOWA gives state-of-the art result in multi-task offline RL with 10% subsampled Atari (Mean 0.456, IQM 0.789)
  • Experimental results of Table 3 shows that JOWA significantly outperforms prior works in fine-tuning (Mean 0.647, IQM 0.647)

S2. JOWA shows favorable scaling property.

  • Increase in the number of parameters leads to better performance.
  • Compared to other methods, the performance increase is significant.

S3. Ablation studies shows useful insights on JOWA's performance.

  • Planning with world model largely contributes the performance.
  • Training policy jointly with the world model is important for multi-task agents.

S4. Open-source codes & checkpoints.

  • Those resources will be helpful for the community.

缺点

W1. Evaluated only in Atari

  • For example, continous tasks like DMControl are relatively simple but challenging.
  • It will be exciting to see if JOWA can be also applied in continuous tasks.

W2. Discrepancy between Theorem 1 and planning of JOWA

  • Although the planning emprically improves the performance, it seems that the theory slightly differs with the planning of JOWA. Specifcally, Theorem 1 assumes critic is the estimate of the optimal Q-function, but JOWA uses a conservative estimate (CQL). Thus, Q(s,a)Q(s,a)Q(s, a) - Q*(s, a) will not be bounded by the term mentioned in Theorem 1; Q(s, a) will be much smaller than Q*(s, a) in reality.
  • Please correct me in the point if I'm wrong.

问题

Q1. Can Theorem 1 be applied for planning of JOWA?

  • Theorem 1 shows that error caused by model-based planning can be bounded. However, it assumes that the critic Q(s,a)Q(s,a) is the estimate of the optimal Q-function (i.e. E[tγtrt]\mathbb{E}[\sum_t \gamma^t r_t]). However, the critic is learned via CQL, which tends to underestimate the Q-values, as discussed in [1]. Thus, Q(s,a)Q(s,a)Q(s, a) - Q*(s, a) will not be bounded by the term mentioned in Theorem 1; Q(s, a) will be much smaller than Q(s,a)=E[tγtrt]Q^{*}(s, a) = \mathbb{E}[\sum_t \gamma^t r_t] in reality.
  • Can you share your thoughts on this? Please correct me in the point if I'm wrong.

Q2. Scaling property in fine-tuning tasks

  • Can you provide the fine-tuning results for JOWA-40M, JOWA-70M?
  • I think it will be helpful on investigating scaling effects in fine-tuning.

Q3. Extending JOWA to continuous environments

  • It would be exciting if JOWA can be also extended to continuous tasks. For example, continous tasks like DMControl are relatively simple but challenging.
  • However, I understand that the training process is time-consuming and authors should put a lot of effort on this. Thus, I'm not strongly suggesting (but welcome) the authors to answer this question via additional experiments; if then, telling your thoughts or ideas on how to extend this work to continuous environments will be helpful for potential users of JOWA and future works.

[1] Mitsuhiko Nakamoto et al., Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning, NeurIPS 2023.

评论

[Q3] Extending JOWA to continuous environments.

[A5] We appreciate your interset in extending JOWA to continous control tasks. We believe that extending JOWA to the continuous control tasks is an interesting and promising direction for future work. We have already started arranging computing resources for this project. Our ultimate goal is to use TD to learn a generalist/multi-task policy (on common benchmarks, such as Atari, ProcGen, DMLab, DMControl, and Meta-World) from the offline dataset while also focusing on efficient adaptation to OOD tasks.

We sincerely apologize that due to the time limit of the rebuttal period, we have allocated our resources to other experiments, making it impossible to demonstrate JOWA's performance on continuous tasks at present. However, we can outline 3 potential approaches for extending JOWA to continuous environments:

  1. C51-style approach: We divide the value range of each action dimension evenly into KK intervals and use the mean of the interval to represent the value of this interval. We then form the discrete action space by taking the union of all possible discrete values and treat it as a discrete control task using C51-style method to train an action-level optimal Q-function. While this approach requires minimal code changes, we have concerns about the performance impact of discretization and the exponential growth of the action space in high-dimensional environments.
  2. Q-Transformer-style approach: After discretizing each action dimension, we learn a dimension-level (rather than action-level) optimal Q-function like Q-Transformer[1]. This approach requires moderate code modifications. However, we are concerned that the additional Bellman updates across action dimensions might exacerbate the credit assignment problem in high-dimensional action tasks.
  3. REINFORCE-style approach: Following action dimension discretization, we learn an action-level Q-function (not optimal Q-function) where the Q-head takes the last action dimension token's embedding as input. We calculate the target Q-value using the Monte Carlo method with the corresponding offline trajectory and treat Q-head training as a multi-class classification problem using CELoss (with CQL penalty) for stability. Then A dimension-level policy head uses each one of embeddings from last observation token to the penultimate action dimension token to predict next action dimension, trained via REINFORCE rather than auto-regression. This framework (action-level Q-function with dimension-level policy) seems somehow like ArCHer[2], but we use the shared backbone for Q-network and policy network and also incorporates world model for planning. This method requires substantial code changes and potential re-experimentation on Atari, but in my experience it may be the most stable way for training.

Currently, we have begun preparing computing resources for experiments on small-scale multi-continuous tasks. We plan to start with the second method as we progress toward our ultimate goal.

[1] Chebotar, Yevgen, et al. "Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions." Conference on Robot Learning. PMLR, 2023.

[2] Zhou, Yifei, et al. "Archer: Training language model agents via hierarchical multi-turn rl." arXiv preprint arXiv:2402.19446 (2024).

评论

Thank you for the authors for the detailed response. Regarding the responses to my main questions:

A1, A5. I understand that it is challenging to do additional experiments for continuous environments. I believe that the potential approaches shared by the authors will be helpful for future works from JOWA.

A2, A3. Thank you for reviewing my concern. Now I clearly see that Theorem 1 actually shines when Q estimation is inaccurate, since the model prediction will be much accurate.

A4. Thank you for providing the experimental result. The result clearly shows the scaling property with IQM and Mean scores on fine-tuning. I think adding this result in the paper will be helpful on showing the scaling property in fine-tuning.

Overall, I believe this paper provides a meaningful contribution the offline RL community. I will maintain my score of 8 but have increased my confidence to 5 that it should be accepted.

评论

Thank you for recognizing our work and for your time and effort as a reviewer! We will add experiments on the scaling property of the fine-tuning task in Sections 5.4 and 5.5 of our main paper.

评论

Dear Reviewer XU9g,

We sincerely appreciate your valuable and insightful comments as they are extremely helpful for improving our manuscript. We have addressed each comment in detail in the paragraphs below.


[w1] Evaluated only in Atari

[A1] We appreciate your interset in extending JOWA to continous control tasks. We believe that extending JOWA to the continuous control tasks is an interesting and promising direction for future work. We have already started arranging computing resources for this project. Our ultimate goal is to use TD to learn a generalist/multi-task policy (on common benchmarks, such as Atari, ProcGen, DMLab, DMControl, and Meta-World) from the offline dataset while also focusing on efficient adaptation to OOD tasks.


[w2] Discrepancy between Theorem 1 and planning of JOWA

[A2] Thank you for bringing this to our attention. We give the detailed answer in [A3].


[Q1] Can Theorem 1 be applied for planning of JOWA?

[A3] Thank you for bringing this to our attention. The problem you mentioned about CQL causing the Q-value underestimation does exist. However, after carefully reviewing the proof of Theorem 1 in Appendix C, we found that the theorem still holds even with this underestimation issue, provided that the error between the CQL-optimized Q-function and the ground-truth optimal Q-function QQ^* is bounded by ϵQ\epsilon_Q.

This bounded error assumption necessarily holds for JOWA because we restrict the Q-value to the range [Q_\min, Q_\max] as required by the C51 algorithm. Therefore, by setting \epsilon_Q = Q_\max-Q_\min, the assumption holds, and then the theorem holds.


[Q2] Scaling property in fine-tuning tasks

[A4] Thank you for highlighting this. We fine-tuned JOWA-40M and JOWA-70M using the same 5k transitions and present the results in the following table. The results show that the Mean and IQM DQN-normalized scores scales with model size while Median DNS doesn't.

Table 1: Fine-tuning results for JOWA.

GameJOWA-150MJOWA-70MJOWA-40M
Gravitar273.3387.4326.5
MsPacman2016.7908.01296.3
Pong17.714.614.2
Robotank25.013.85.4
YarsRevenge17506.216339.214072.1
#Mean DNS0.6470.5760.504
Median DNS0.6150.7150.512
IQM DNS0.6470.6030.498
审稿意见
6

The paper introduces JOWA (Jointly-Optimized World-Action model), a novel offline model-based reinforcement learning (RL) agent designed for generalization across multiple tasks, specifically focusing on scaling RL with image-based observations. JOWA is pretrained on multiple Atari games using a shared transformer backbone that jointly optimizes a world model and action model. This setup allows the model to learn both general-purpose representations and decision-making skills.

JOWA stabilizes temporal difference learning by incorporating world modeling as a regularizer, enabling large models to scale effectively. The proposed framework also features a parallelizable planning algorithm to improve Q-value estimation and policy search at inference time. Experiments show that JOWA outperforms state-of-the-art offline RL methods, achieving 78.9% human-level performance on pretrained games and demonstrating strong sample-efficient transfer to novel games with minimal fine-tuning.

优点

  • Extension from Sequential Modeling: JOWA effectively adapts transformer-based architectures from NLP to reinforcement learning, leveraging tokenization for image observations and combining offline RL with online planning to scale across tasks.
  • Strong Baseline Performance: It outperforms state-of-the-art methods, showing significant improvements on Atari benchmarks while requiring only 10% of the dataset.
  • Adaptability: JOWA demonstrates efficient transfer to new tasks with minimal fine-tuning, highlighting its robustness and generalization capabilities.

缺点

  • Limited Originality: While JOWA effectively combines existing state-of-the-art techniques, it lacks substantial novelty in introducing new ideas. The core contributions largely build upon well-established methods, such as transformers, tokenization, and offline RL.
  • Reliance on Expert Data for New Games: While JOWA demonstrates strong performance in fine-tuning to new tasks, it heavily relies on expert-level data for unseen games. This dependence on high-quality data might limit its applicability in scenarios where such data is not readily available or costly to obtain.
  • Excessive Discretization: The paper discretizes the reward into the set {-1, 0, 1}, which may lead to a loss of important nuance and granularity in reward signals. This overly simplified reward structure might not capture the full complexity of more sophisticated environments, limiting the model’s performance in tasks requiring finer reward distinctions.

问题

  • Why was the sequence length relative low (8)?
  • Could JOWA learn useful policy on new games using non-expert transitions?
  • It seems the number for MGDT are very different from the original paper (MGDT achieved 93% IQM HNS on 40M model). Why is this?
  • Is the action-part module (CQL) hard to tune in this method?
评论

Dear Reviewer pLGR,

We sincerely appreciate your valuable and insightful comments as they are extremely helpful for improving our manuscript. We have addressed each comment in detail in the paragraphs below.


[w1] Limited Originality.

[A1] Thank you for pointing this out. We think that the novelty of JOWA can be examined in the following ways:

  1. JOWA examines which existing technologies are crucial for scaling RL, providing a novel perspective on evaluating established techniques.
  2. JOWA introduces a novel planning algorithm. Our ablation experiments (Table 4) demonstrate that this planning algorithm significantly improves performance. As detailed in our response [A5] to reviewer zGRd, when compared to MCTS, our algorithm is not only 10×\times faster but also achieves better performance.

[w2] Reliance on Expert Data for New Games.

[A2] Thank you for your greatful suggestion. We have conducted additional fine-tuning experiments using non-expert transitions. See [A5] for details.


[w3] Excessive Discretization of rewards.

[A3] Thank you for highlighting this. We want to clarify that using the sign function to clip reward is a common trick in Atari domain [1~4] and is adopted by the Stable Baseline3's Atari wrapper [5]. This trick serves several important purposes:

  1. It helps unify the value range across different games, enabling us to use the same value network architecture for multiple games.
  2. It reduces the scale disparity between different games, making multi-task learning more stable.

While we acknowledge that this discretization might lose some reward granularity, empirical results from multiple studies have shown its effectiveness in practice, particularly in the Atari domain.

[1] Kaiser, Lukasz, et al. "Model-based reinforcement learning for atari." arXiv preprint arXiv:1903.00374 (2019).

[2] Micheli, Vincent, Eloi Alonso, and François Fleuret. "Transformers are Sample-Efficient World Models." The Eleventh International Conference on Learning Representations.

[3] Kumar, Aviral, et al. "Offline q-learning on diverse multi-task data both scales and generalizes." arXiv preprint arXiv:2211.15144 (2022).

[4] Lee, Kuang-Huei, et al. "Multi-game decision transformers." Advances in Neural Information Processing Systems 35 (2022): 27921-27936.

[5] Atari Wrappers — Stable Baselines3 0.11.1 documentation


[Q1] Why was the sequence length relative low (8)?

[A4] Thank you for raising this question about the sequence length. We set the sequence length to 8 based on the following considerations:

  1. Computational Efficiency: Given the substantial size of the dataset (~3TB), we can't load it entirely into memory and must read transitions from disk when the dataloader calls. The sequence length directly impacts how many observations we need to read per time. A longer sequence length (e.g., 20) would make data loading twice as slow as the training time (forward+backward), creating a significant bottleneck in training speed.
  2. Minimum Required Context: Based on DQN experience, we estimate that 4-step observations are sufficient for decision-making. Given our planning algorithm's horizon H=2H=2, the minimum required sequence length is 6. We selected 8 as it is the smallest power of 2 exceeding this minimum requirement.
  3. Prior Research: We also refer to this paper's[1] experimental results on context length (Figure 6 in its appendix) to some extent, although their experiments were conducted on Decision Transformer.

[1] Bhargava, Prajjwal, et al. "When should we prefer Decision Transformers for Offline Reinforcement Learning?." arXiv preprint arXiv:2305.14550 (2023).

评论

[Q3] It seems the number for MGDT are very different from the original paper (MGDT achieved 93% IQM HNS on 40M model). Why is this?

[A6] Thank you for pointing this out. We carefully examined the results from the original MGDT and ours, and found that different game sets lead to substantially different IQM scores even with similar raw scores. For clarity, we present the raw scores of MGDT-40M and -200M from our experiments and the original MGDT paper for our 12 pretrained games (note that other 3 games - Berzerk, SpaceInvaders, and StarGunner - were not pretrained in the original MGDT). Specifically, the MGDT-40M raw scores are from the Scaled-QL paper, as the original MGDT paper did not report MGDT-40M raw scores, while the Scaled-QL paper (by the same research team) did.

The results for these 12 games show that:

  1. The IQM HNS of MGDT-200M from our experiments and the original paper are 0.502 and 0.584 respectively.
  2. The IQM HNS of MGDT-40M from our experiments and the original paper are 0.382 and 0.604 respectively.
  3. Interestingly, the original MGDT-40M achieves higher IQM HNS than the original MGDT-200M on these 12 games (0.604 vs. 0.584).

Given that the original MGDT was pretrained for 100M gradient steps while we pretrained all methods for only 1.75M steps, the IQM HNS difference for MGDT-200M is understandable. However, the surprisingly strong performance of the original MGDT-40M makes our reproduction results appear weaker.

Table 2: Raw scores of MGDT from ours and MGDT's paper.

GameRandomHumanOurs MGDT-40MOriginal MGDT-40MOurs MGDT-200MOriginal MGDT-200M
Assault222.4742.01227.21772.21741.52385.9
Atlantis1285029028.126657.1304931.22565750.03105342.3
BeamRider363.916926.5972.03225.56011.38560.5
Carnival038003460.03786.92610.02213.8
Centipede2090.9120173024.02867.54604.02463.0
ChopperCommand811.07387.82400.03337.53300.84268.8
DemonAttack152.119711943.33629.46549.423768.4
NameThisGame2292.380494691.47777.56610.59056.9
Phoenix761.47242.63522.84744.45120.55295.6
Seaquest68.442054.7700.03112.52720.05173.8
TimePilot35685229.24000.03487.53866.72743.8
Zaxxon32.59173.3125.04637.5462.5275.0
#IQM HNS0.0001.0000.3820.6040.5020.584

[Q4] Is the action-part module (CQL) hard to tune in this method?

[A7] We appreciate your interest in the difficulty of tuing CQL. When combining the CQL term with distributional TD loss, we found the tuning of the CQL coefficient α\alpha to be relatively straightforward. We tested α\alpha values of 0.1 and 0.05, and both performed well. However, we encountered significant challenges when tuning α\alpha for the combination of CQL with MSE TD loss. As mentioned in our ablation study (line 489), we tested α\alpha values in {0.01, 0.05, 0.1}, but the agents consistently over-optimized the CQL term regardless of the α\alpha value. This led to extremely high Q-values for in-domain state-actions and low Q-values for OOD actions.

评论

[Q2] Could JOWA learn useful policy on new games using non-expert transitions?

[A5] We appreciate your interest in fine-tuning JOWA with non-expert transitions. To address this question, we conducted additional fine-tuning experiments using 5k suboptimal and highly-suboptimal transitions. Specifically, the suboptimal transitions were uniformly sampled from the complete DQN-Replay dataset, and the highly-suboptimal transitions were uniformly sampled from the initial 20% of the DQN-Replay dataset.

The results shown in the following table demonstrate that the fine-tuning performance strongly correlates with data quality. The mean DQN-normalized scores for expert, suboptimal, and highly-suboptimal data are 0.647, 0.516, and 0.422 respectively.

Table 1: Fine-tuning experiments of JOWA-150M with various type of transitions.

GameExpertSuboptimalHighly-suboptimal
Gravitar273.3317.8296.0
MsPacman2016.71005.51126.8
Pong17.713.213.8
Robotank25.014.68.5
YarsRevenge17506.215085.09755.4
#Mean DNS0.6470.5160.422
Median DNS0.6150.4830.410
IQM DNS0.6470.5110.383
评论

Dear Reviewer pLGR,

As the discussion deadline is approaching (<3 days), we are actively looking forward to your further feedback. Thanks for your effort and understanding!

Kindest regards,

Authors of ICLR Submission 1467

评论

Dear Reviewer pLGR,

As the discussion deadline is approaching, we are actively looking forward to your valuable feedback and would be very grateful if you could take a moment to review our responses.

We sincerely appreciate your precious time and consideration!

Kindest regards,

Authors of ICLR Submission 1467

评论

Dear Reviewer pLGR,

First of all, we would like to wish you a Happy Thanksgiving! We hope you are enjoying this special holiday with your loved ones.

As the discussion deadline is approaching, we are actively looking forward to your valuable feedback and would be very grateful if you could take a moment to review our responses when you have time after the holiday.

We sincerely appreciate your precious time and consideration!

Kindest regards,

Authors of ICLR Submission 1467

评论

Dear Reviewer pLGR,

As we draw closer to the discussion deadline (<1 day), we deeply value your expertise and perspective on our work, and we are eagerly anticipating your thoughtful feedback.

Understanding that your schedule may be demanding, we would be immensely grateful if you could find a moment to share your insights before the upcoming deadline. Your contribution is crucial in helping us refine our research and enhance its impact within the academic community.

Warmest regards,

Authors of ICLR Submission 1467

评论

Dear Reviewer pLGR,

As the rebuttal deadline is fast approaching (<2 hours), we kindly remind you that this is our final opportunity for discussion during rebuttal period. We would greatly appreciate it if you could spare a moment to review our responses. Your feedback is invaluable to us, and we sincerely hope for your consideration before the deadline.

Thank you once again for your precious time and thoughtful evaluation.

Kindest regards,

Authors of ICLR Submission 1467

审稿意见
6

This paper introduces JOWA, a Jointly-Optimized World-Action model for offline RL. By optimizing a shared transformer backbone for world-action modeling and employing an efficient planning algorithm to improve policy search, JOWA achieves good performance on pretrained games with limited data and demonstrates superior generalization to new games with minimal fine-tuning trajectories.

优点

  1. This paper is clearly written and easy to follow.
  2. This paper presents sufficient experimental results to demonstrate the validity of its proposed method.

缺点

  1. A large generalist TD-MPC2 agent is capable of performing a variety of tasks across multiple domains. I wonder if the proposed method is better than TD-MPC2 in the offline setup.
  2. Extending the experiments beyond Atari to more complex environments like Kitchen or Meta-World would offer stronger validation of the proposed method's effectiveness.

问题

Please see the weaknesses section.

评论

Dear Reviewer pKew,

We sincerely appreciate your valuable and insightful comments as they are extremely helpful for improving our manuscript. We have addressed each comment in detail in the paragraphs below.


[w1] A large generalist TD-MPC2 agent is capable of performing a variety of tasks across multiple domains. I wonder if the proposed method is better than TD-MPC2 in the offline setup.

[A1] Thank you for pointing this out. Due to the inherent characteristics of the MPPI algorithm, TD-MPC series can't handle discrete control tasks, as explicitly stated in TD-MPC2's paper (see section I in its appendix). Similarly, the current version of JOWA can't handle continuous control tasks because it is as it is based on a C51-style architecture. Therefore, a direct comparison between these two methods is not currently feasible.


[w2] Extending the experiments beyond Atari to more complex environments like Kitchen or Meta-World would offer stronger validation of the proposed method's effectiveness.

[A2] We appreciate your interset in extending JOWA to continous control tasks. We believe that extending JOWA to the continuous control tasks is an interesting and promising direction for future work. We have already started arranging computing resources for this project. Our ultimate goal is to use TD to learn a generalist/multi-task policy (on common benchmarks, such as Atari, ProcGen, DMLab, DMControl, and Meta-World) from the offline dataset while also focusing on efficient adaptation to OOD tasks. At that time, we will of course compare with TD-MPC2 on continuous control tasks.

We sincerely apologize that due to the time limit of the rebuttal period, we have allocated our resources to other experiments, making it impossible to demonstrate JOWA's performance on continuous tasks at present. However, we can outline 3 potential approaches for extending JOWA to continuous environments:

  1. C51-style approach: We divide the value range of each action dimension evenly into KK intervals and use the mean of the interval to represent the value of this interval. We then form the discrete action space by taking the union of all possible discrete values and treat it as a discrete control task using C51-style method to train an action-level optimal Q-function. While this approach requires minimal code changes, we have concerns about the performance impact of discretization and the exponential growth of the action space in high-dimensional environments.
  2. Q-Transformer-style approach: After discretizing each action dimension, we learn a dimension-level (rather than action-level) optimal Q-function like Q-Transformer[1]. This approach requires moderate code modifications. However, we are concerned that the additional Bellman updates across action dimensions might exacerbate the credit assignment problem in high-dimensional action tasks.
  3. REINFORCE-style approach: Following action dimension discretization, we learn an action-level Q-function (not optimal Q-function) where the Q-head takes the last action dimension token's embedding as input. We calculate the target Q-value using the Monte Carlo method with the corresponding offline trajectory and treat Q-head training as a multi-class classification problem using CELoss (with CQL penalty) for stability. Then A dimension-level policy head uses each one of embeddings from last observation token to the penultimate action dimension token to predict next action dimension, trained via REINFORCE rather than auto-regression. This framework (action-level Q-function with dimension-level policy) seems somehow like ArCHer[2], but we use the shared backbone for Q-network and policy network and also incorporates world model for planning. This method requires substantial code changes and potential re-experimentation on Atari, but in my experience it may be the most stable way for training.

Currently, we have begun preparing computing resources for experiments on small-scale multi-continuous tasks. We plan to start with the second method as we progress toward our ultimate goal.

[1] Chebotar, Yevgen, et al. "Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions." Conference on Robot Learning. PMLR, 2023.

[2] Zhou, Yifei, et al. "Archer: Training language model agents via hierarchical multi-turn rl." arXiv preprint arXiv:2402.19446 (2024).

评论

Dear Reviewer pKew,

As the discussion deadline is approaching (<3 days), we are actively looking forward to your further feedback. Thanks for your effort and understanding!

Kindest regards,

Authors of ICLR Submission 1467

评论

I appreciate the response provided by the author. I increase my confidence by 1.

评论

Thank you for recognizing our work and for your time and effort as a reviewer!

审稿意见
8

This paper introduces a model-based RL agent, JOWA, which employs a transformer architecture and jointly optimizes world dynamics and Q-values across different environments. To compensate for inaccurate Q-values, the learned world model enables JOWA to search out the optimal policy via planning during inference time. Empirically, JOWA demonstrates impressive performance in a low-data regime by training on 15 Atari games. Moreover, JOWA's performance scales up with model sizes.

优点

  • The proposed JOWA outperforms existing SOTA methods by a large margin. The perform scales up with model sizes

  • The joint optimization of world and action models stabilizes large-scale multi-task offline RL training.

  • The ablation studies comprehensively study the key design choices of the proposed methods.

缺点

  • The proposed method combines the best offline RL training techniques, leveraging the world modeling loss to stabilize Q-value learning. The empirical performance is impressive. However, the technical novelty is thus limited.

  • By taking a closer look at Table 2, we can see that the 150M variant does not consistently outperform the 40M and 70M variants on all tasks. For example, the 40M variant achieves the highest score on Centipede, while the 70M variant excels on NameThisGame, SpaceInvaders, and StarGunner. Can you explain the phenomenon a bit more? Is it because of the training instability? Can further increase the model size and training iterations solve the issue?

  • It would be insightful to examine the emergent behaviors that arise when scaling up model size. For example, the 150M JOWA achieves a significantly higher score than the 70M JOWA and the other baseline methods on Zaxxon. Does the 150M JOWA agent exhibit any distinct emergent behaviors that could explain this performance improvement?

  • This work misses the citation for [1,2]

[1] Li et al., Multi-task batch reinforcement learning with metric learning. NeurIPS 2020.

[2] Li et al., Offline reinforcement learning with closed-form policy improvement operators. ICML 2023

问题

How's the model performance when you further scale up the model size, e.g., to 1B?

评论

Dear Reviewer 7Vxq,

We sincerely appreciate your valuable and insightful comments as they are extremely helpful for improving our manuscript. We have addressed each comment in detail in the paragraphs below.


[w1] The proposed method combines the best offline RL training techniques, leveraging the world modeling loss to stabilize Q-value learning. The empirical performance is impressive. However, the technical novelty is thus limited.

[A1] Thank you for pointing this out. We think that the novelty of JOWA can be examined in the following ways:

  1. JOWA examines which existing technologies are crucial for scaling RL, providing a novel perspective on evaluating established techniques.
  2. JOWA introduces a novel planning algorithm. Our ablation experiments (Table 4) demonstrate that this planning algorithm significantly improves performance. As detailed in our response [A5] to reviewer zGRd, when compared to MCTS, our algorithm is not only 10×\times faster but also achieves better performance.

[w2] By taking a closer look at Table 2, we can see that the 150M variant does not consistently outperform the 40M and 70M variants on all tasks. For example, the 40M variant achieves the highest score on Centipede, while the 70M variant excels on NameThisGame, SpaceInvaders, and StarGunner. Can you explain the phenomenon a bit more? Is it because of the training instability? Can further increase the model size and training iterations solve the issue?

[A2] Thank you for highlighting this. We checked the action sequence output by JOWA-40M on Centipede and found that the agent consistently executes the rightfire action after the first few steps, which is not a fully meaningful behavior but surprisingly dodge enemy attacks and get the highest score. We further evaluated the last 5 checkpoints of JOWA-70M and JOWA-150M and found the winner on NameThisGame, SpaceInvaders, and StarGunner to be irregular, i.e., no one wins each of the game consistently.

Moreover, upon careful examination of table C.2 in the appendix of Scaled-QL's paper[1], we found that MGDT-200M also performs worse than MGDT-40M on 13 games (BankHeist, Carnival, Centipede, FishingDerby, Freeway, Gravitar, IceHockey, Jamesbond, KungFuMaster, TimePilot, VideoPinball, WizardOfWor, and Zaxxon). Therefore, we believe this inconsistence is an inherent problem of offline RL and unfortunately, neither increasing model size (such as scaling from 40M to 200M) nor extending training iterations (even though the original MGDT[2] was trained for 100M gradient steps) can solve this issue.

[1] Kumar, Aviral, et al. "Offline q-learning on diverse multi-task data both scales and generalizes." arXiv preprint arXiv:2211.15144 (2022).

[2] Lee, Kuang-Huei, et al. "Multi-game decision transformers." Advances in Neural Information Processing Systems 35 (2022): 27921-27936.


[w3] It would be insightful to examine the emergent behaviors that arise when scaling up model size. For example, the 150M JOWA achieves a significantly higher score than the 70M JOWA and the other baseline methods on Zaxxon. Does the 150M JOWA agent exhibit any distinct emergent behaviors that could explain this performance improvement?

[A3] Thank you for pointing this out. We examined the reward of Zaxxon game and found that this "emergent behaviors" is actually attributable to the nonlinear reward function. The first reward in Zaxxon is often nearly 5000. JOWA-40M gets the large reward in one of the 16 rollouts while JOWA-150M gets it in 7 rollouts. In other words, when we clip the rewards to the range [-1,1], JOWA-70M and JOWA-150M achieve scores of 0.11 and 0.44 respectively, indicating no true 'emergence' phenomenon. We find the literature[1] draws a similar conclusion that nonlinear or discontinuous metrics cause the emergent abilities.

[1] Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are emergent abilities of large language models a mirage?." Advances in Neural Information Processing Systems 36 (2024).


[w4] This work misses the citation for [1,2]

[A4] Thank you for your suggestion. We will add these two works to our related work section to improve our paper.


[Q1] How's the model performance when you further scale up the model size, e.g., to 1B?

[A5] We appreciate your interest in further scaling the model size. While this is an interesting and important question, due to our current computational constraints, we cannot complete such experiments during the rebuttal period. We estimate that training JOWA-1B would require 16 A100 GPUs running for 1-2 months. Therefore, we will leave this as future work.

评论

Thank you for your detailed response. I increase my confidence by 1.

I suggest you highlight the efficiency of your planning algorithm in your revised manuscript.

I like your responses in A2 and A3, and I suggest you discuss the actual behavior of the agent in the Appendix. Moreover, please include the phenomenon of "inversion scaling law", i.e., increasing model parameters leads to decreasing performance in specific games, in the limitation section, in the Limitation section.

Lastly, I would like to remind the authors that they can update the manuscript during the rebuttal period to incorporate their modifications. Doing so could help address, for example, Reviewer zGRd's concerns more effectively.

评论

Thank you for recognizing our work and for your time and effort as a reviewer! We will upload an updated manuscript within the next 24 hours to include new experiments, conclusions, related work, limitations, etc. added during the rebuttal period.

审稿意见
5

This paper presents JOWA, an offline model-based reinforcement learning method aimed at scaling and generalizing multi-task RL through a shared transformer-based architecture. JOWA is designed to stabilize temporal difference learning for large models by integrating world modeling with Q-value criticism, thus leveraging a shared transformer backbone for both tasks. The paper introduces a novel parallelizable planning algorithm to counter Q-value estimation errors, achieving more consistent policy identification during inference. The model is pre-trained on Atari games with minimal data and reportedly outperforms existing baselines, demonstrating some generalization capacity to new tasks.

优点

  1. Scalability: JOWA’s design showcases robust scaling potential, as performance improves with model size without the usual TD-learning instability issues.

  2. Detailed Ablation Studies: The authors conducted extensive ablations, examining the impact of core design elements such as task embeddings, training losses, and synthetic data usage.

缺点

  1. Missing related work and explanations for the architecture of the world-action modeling. The work proposes to use VQ-VAE for the representation learning in Atari games, while there is a work named Forward-Inverse Cycle Consistency (FICC), which uses VQ-VAE for the offline Atari dataset to learn representations and action embeddings. It seems this pipeline is so similar to the FICC, however, the author didn't mention the difference between JOWA and FICC. (Both use offline Atari datasets for model-based RL and learning universal policy under VQ-VAE loss terms)

Ye, W., Zhang, Y., Abbeel, P., & Gao, Y. (2022). Become a proficient player with limited data through watching pure videos. In The Eleventh International Conference on Learning Representations.

  1. Overclaim of the generalization ability. The experiments are based on Atari games, while using offline atari dataset for pre-training. Extending this approach to more complex control task scenarios (e.g., robotics or high-dimensional continuous control tasks) remains unproven. I'm not sure about the effectiveness of the scaling results.

问题

  1. How might JOWA perform in a single-task setting compared to standard model-based methods? Would the shared architecture yield efficiency or performance gains? If it is worse than training from scratch with model-based RL, why the Atari games are good benchmarks for evaluating the scaling laws?

  2. The author mentioned the beam search, so why not use MCTS, which is proven effective in MuZero and EfficientZero. And the author didn't compare with other model-based RL algorithms, such as Dreamer.

评论

Dear Reviewer zGRd,

We sincerely appreciate your valuable and insightful comments as they are extremely helpful for improving our manuscript. We have addressed each comment in detail in the paragraphs below.


[w1] Missing related work and explanations of FICC.

[A1]. Thank you for bringing this to our attention. We apologize for missing this excellent paper due to our limited knowledge. We have modified our paper to include FICC as related work and have employed it as a model-based baseline for comparison (see [A4]). We explain the differences between JOWA and FICC as follows:

  1. FICC pretrains representation and dynamic networks using action-free videos and then primarily fine-tunes reward, value and policy networks on downstream tasks with an action adapter, while JOWA pretrains all networks (representation, dynamic, reward, value networks) with action-aware trajectories. Actually, this is the main difference between FICC and JOWA. FICC and JOWA represent two different pretraining objectives, i.e., pretraining with videos or trajectory data respectively. This main difference affects both in-domain performance and OOD fine-tuning performance (see [A4]).
  2. FICC pretrains the inverse dynamic model, forward dynamic model, and latent action codebooks in a VQ-VAE style pipeline, while JOWA only employs VQ-VAE as the image tokenizer and pretrains the dynamic model in the auto-regression way.
  3. The Original FICC obtains a multi-task dynamic model while gets a single-task policy, as shown in section 5.3 of their paper, the authors pretrained the representation network and dynamic model on 6 environments while fine-tuned on each environment respectively. In contrast, JOWA obtains a multi-task policy after pretraining. But we can build action adapters for each environment independently and then aggregate them into a two-level dictionary for multi-task fine-tuning of FICC, so this is not a big difference.
  4. FICC only open sources the pretraining codes, no fine-tuning code and model weights, while JOWA open source pretraining, fine-tuning, evaluation codes, and model weights. Due to the extensive training time, we believe that our fully open-source approach was necessary to provide the community with valuable sources.

In conclusion, we thanks you again for bringing this excellent work to us. The differences and performance comparison between FICC and JOWA helps us to improve our paper and gain deeper insights into the impact of different objectives on RL pre-training.


[w2] Overclaim of the generalization ability.

[A2] Thank you for highlighting this. However, we have carefully reviewed our contribution on generalization in abstract and introduction, and believe that we have clearly restricted our conclusions about generalization to the Atari game domain. For example, we state that "JOWA ... and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization." in abstract and "Third, JOWA enables sample-efficient transfer to diverse unseen games with 64.7% DQN-normalized score using only 5k transitions per game, surpassing baselines by 34.7% on average." in introduction. Following MGDT and Scaled-QL, our work focuses on Atari games and considers generalization to OOD Atari games. Therefore, we cautiously state our conclusions about generalization, limiting the property to the Atari domain.

We believe that extending JOWA to the continuous control tasks is an interesting and promising direction for future work. We have already started arranging computing resources for this project. Our ultimate goal is to use TD to learn a generalist/multi-task policy (on common benchmarks, such as Atari, ProcGen, DMLab, DMControl, and Meta-World) from the offline dataset while also focusing on efficient adaptation to OOD tasks.

评论

[Part 2 of 2 for [A6]]

We implement FICC-L through replacing the residual blocks in the representation, dynamic, and LAG networks with ResNet-50 style residual blocks, resulting in FICC-85M. We use the same 10% down-sampled dataset for pretraining representation and dynamic models for 0.5M steps, computing action adapter, and primarily fine-tuning Q-function on 15 games all at once for 1.25M steps. To save time and ensure fair comparison on pretraining objectives, we employ the same training method of Q-function as JOWA when fine-tuning FICC and evaluate FICC-85M with beam search planning, rather than fine-tuning with EfficientZero. We use 32*A100 GPUs to train FICC-85M for 5 days. Here, we show the results of FICC-85M on both in-domain 15 games and OOD 5 games in the following tables. These results show that:

  1. JOWA-70M and JOWA-150M exceeds FICC-85M 9.9% and 82.2% IQM HNS on in-domain 15 Atari games, respectively.
  2. JOWA-70M and JOWA-150M exceeds FICC-85M 4.9% and 12.5% IQM DNS on OOD 5 Atari games, respectively.

Therefore, we conclude that within the Atari domain, JOWA's objective is better than FICC. We find that the literature[1] draws a similar conclusion that pretraining with trajectory data is better than video data on Atari domain (see Figure 2 in [1], where ID and Near-OOD are both Atari games).

We will include all the experiments and discussions about FICC in our main paper. Thank you again for bringing this excellent work to us, which helps us to improve our paper and gain deeper insights into the impact of different objectives on RL pre-training.

[1] Kim, Donghu, et al. "Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning." arXiv preprint arXiv:2406.06037 (2024).

Table 4: FICC vs. JOWA on in-domain Atari games

GameRandomHumanFICC-85MJOWA-70MJOWA-150M
Assault222.4742.0925.91733.92302
Atlantis1285029028.186250570862.52690387.5
BeamRider363.916926.568222547.43498
Berzerk123.72630.4400441.9739
Carnival03800282040705316
Centipede2090.9120173742.24475.64677
ChopperCommand811.07387.82835.62568.83812.5
DemonAttack152.119715806.44584.43547.8
NameThisGame2292.38049623612706.911421
Phoenix761.47242.63814.550655348
Seaquest68.442054.7176014902725
SpaceInvaders1481668.7641.2969.1744.7
StarGunner664.0102504936.421231.318150
TimePilot35685229.24166.73831.33669
Zaxxon32.59173.3312.52252163
#IQM HNS0.0001.0000.4330.4760.789

Table 5: FICC vs. JOWA on OOD Atari games

GameRandomHumanFICC-85MJOWA-70MJOWA-150M
Gravitar173.0473.0342.5387.4273.3
MsPacman307.33085.61252.2908.02016.7
Pong-20.719.514.014.617.7
Robotank2.263.911.513.825.0
YarsRevenge3092.918089.915036.216339.217506.2
#Mean DNS0.0001.0000.5430.5760.647
Median DNS0.0001.0000.5650.7150.615
IQM DNS0.0001.0000.5750.6030.647
评论

[Q3] Why not use MCTS?

[A5] We appreciate your interest in using MCTS as the planning algorithm. Actually, we conducted a preliminary experiment to evaluate the effectiveness of MCTS and found that MCTS is about 10x slower than our proposed planning algorithm due to non-parallelizability of MCTS. Here, to answer this question more formally, we conducted a more comprehensive experiment on planning JOWA with MCTS. We implement muzero-style MCTS in python rather than C++ for fair speed comparison. With the same expansion state budget, our beam search style planning and MCTS achieves 1.26FPS and 0.12FPS respectively.

For MCTS, we conduct a grid search on the following choices:

  1. Compute V-value using Q.mean() or Q.max() (note that JOWA only pretrains a optimal Q-function, thus no V-function and policy network available).

  2. We employed an energy-based policy to compute action probability, i.e., π=softmax(Qt)\pi=softmax(\frac{Q}{t}), where tt is the temperature. We search tt in {0.01, 0.1, 0.5, 0.7, 0.8, 0.9, 1, 2, 3, 5, 10}.

  3. Use most visited or most valuable action as the optimal action, i.e., argmax(root.children.visit_count) or argmax(root.children.value).

  4. Search the max depth of the tree HH in [1,7].

Finally, we selected the configuration using Q.max(), t=0.9t=0.9, and argmax(root.children.value) for all games while searching HH in {1, 2, 4, 6} for each game. Even though we believe we did a sufficiently adequate hyperparameter search, MCTS still performs worse than beam search, as shown in the following table.

The results show that beam search exceeds MCTS by 71.0% Mean HNS. Moreover, we find that MCTS is highly sensitive to the temperature tt and max depth HH, and inappropriate values of hyperparameters can even degenerate the policy into a randomized policy. However, its long execution time makes it inconvenient to tune hyperparameters.

Table 3: MCTS vs. beam search on JOWA-150M

GameRandomHumanJOWA w/o planningJOWA with MCTSJOWA with beam search
BeamRider363.916926.5864.51137.03498
Berzerk123.72630.4396.9440.0739
Carnival038005560.03340.05316
ChopperCommand811.07387.8806.21850.03812.5
Seaquest68.442054.7267.5760.02725
TimePilot35685229.2662.54100.03669
Zaxxon32.59173.312.550.02163
#Mean HNS0.0001.000-0.020.2210.378
IQM HNS0.0001.0000.030.1340.237

[Q4] JOWA didn't compare with other model-based RL algorithms, such as Dreamer.

[A6] Thank you for raising this point. We didn't compare JOWA with other online model-based RL baselines such as Dreamer for the following reasons:

  1. Dreamer is proposed under online setting, while our work focuses on offline RL. Thus for fair comparison, we need at least adding CQL regularization for Dreamer.
  2. Dreamer needs training with imagination trajectories and uses a high update-to-data (UTD) ratio of 512, both of which are time-consuming and would take more than 6 months to pretrain on 15 games with 16*A100 GPUS. Thus, we need disable model-based data synthesis to make the training time acceptable, just like JOWA's pretraining.
  3. Since we disable the ability of world model during pretraining, we need to use the world model after pre-training, e.g. planning with the world model.
  4. After the above 3 changes to Dreamer, the differences between the modified Dreamer and JOWA are not obvious. The main difference left is the shared backbone of JOWA, which has been proven effective in ablation experiments.

However, we believe that the FICC you mentioned is a worthwhile baseline for comparison. As we stated in [A1], FICC and JOWA represent two different pre-training objectives, so it is well worth discussing the impact of these two objectives on the in-domain and OOD Atari games.

[Part 1 of 2 for [A6], to be continued ...]

评论

[Q1] JOWA vs. standard model-based methods in a single-task setting.

[A3] Thank you for pointing this out. We would highlight that our motivation is building a multi-task policy on Atari from the offline dataset. Therefore, we focus on validating the efficiency and performance gains of the shared architecture under multi-task pretraining setting, and don't compare with common online model-based RL algorithms, such as IRIS, STORM, Muzero, Efficient zero et al., on Atari 100k benchmark, which is beyond the scope of this work. However, our fine-tuning experiments are in the single-task setting, i.e., we fine-tune pretrained models on each OOD game respectively. Here, we implement 2 more offline model-based RL algorithms: MOReL[1] and COMBO[2], which are trained from scratch with 5k transitions along with JOWA-150M trained from scratch and pretrained models. MOReL constructs a conservative MDP with ensembled dynamic models and penalizes on reward according to dynamic uncertainty. COMBO implemented here detaches the Q-head from transformer backbone, resulting in individual optimization of Q function and dynamic model. We show the results in the following table. The results show that:

  1. The performance gains from JOWA's designs (such as shared backbone, planning et al.) have limited benefits when training from scratch in a offline single-task setting, compared with other offline model-based RL algorithms.

  2. Pretrained models still demonstrate superior fine-tuning benefits over methods trained from scratch.

  3. JOWA's designs mainly improve pretraining performance straightforwardly, which in turn facilitates fine-tuning performance and helps pretrained JOWA-150M obtains the best fine-tuning performance in our experiments.

We would highlight that a choice is ordinary in the single-task setting while important in the multi-task setting does not indicate the choice is useless. For example, in the single-task setting, both mean-squared TD error and distributional TD error perform comparably online [3] and offline [4,5]. However, some works [6,7] and we observe that MSE TD error does not scale well under multi-task setting, and performs much worse than distributional TD error (shown in our ablation study).

Table 1: Results of the single-task fine-tuning experiment.

MTBC-120MMGDT-200MEDT-200MSQL-80MJOWA-150MJOWA-150M (scratch)MoReL-130M (scratch)COMBO-150M (scratch)
Mean0.1640.4220.4300.3600.6470.1960.2070.168
Median0.2150.3540.3250.2840.6150.1730.1680.173
IQM0.2050.3770.3800.3550.6470.1810.2140.153

[1] Kidambi, Rahul, et al. "Morel: Model-based offline reinforcement learning." Advances in neural information processing systems 33 (2020): 21810-21823.

[2] Yu, Tianhe, et al. "Combo: Conservative offline model-based policy optimization." Advances in neural information processing systems 34 (2021): 28954-28967.

[3] Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.

[4] Kumar, Aviral, et al. "Conservative q-learning for offline reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 1179-1191.

[5] Kumar, Aviral, et al. "Dr3: Value-based deep reinforcement learning requires explicit regularization." arXiv preprint arXiv:2112.04716 (2021).

[6] Kumar, Aviral, et al. "Offline q-learning on diverse multi-task data both scales and generalizes." arXiv preprint arXiv:2211.15144 (2022).

[7] Farebrother, Jesse, et al. "Stop regressing: Training value functions via classification for scalable deep rl." arXiv preprint arXiv:2403.03950 (2024).

评论

[Q2] Why the Atari games are good benchmarks for evaluating the scaling laws?

[A4] Thank you for bringing this to our attention. We chose Atari as the benchmark for evaluating scaling laws for the following reasons:

  1. Inspired by generative world models like SORA, our motivation is to build a large image observation-based world model from the offline dataset. Therefore, we focus on offline vision tasks.
  2. We investigate the offline RL datasets shown in the following table. After comprehensively consider the vision-task constraint and whether the data volume is sufficient, we finally choose Atari as the benchmark.
  3. We investigate the works about pretraining using Atari dataset and find MGDT, EDT and Scaled-QL, which provide sound and reproducible baselines.
  4. Atari is one of the most common and reliable benchmarks in the development of RL in the past decade. Atari has various types of games, each with different embodiments, dynamics, and reward functions, which makes Atari more challenging than other multi-task benchmarks such as Meta-World.
  5. The performance of multi-task policy on Atari is still behind that of single-task policy, which means there exists room for improvement for multi-task policy on Atari. For example, MGDT shows that the MGDT-200M achieves 126% human-normalized scores (HNS) on 40 Atari games while single-task offline BCQ and single-task online DQN achieves 135% and 144% HNS.

Table 2: Offline RL datasets.

DatasetsVision/StateContinous/DiscreteData volumeLinks
D4RLStateContinousabout 1M transitions per taskrepo
AtariVisionDiscrete50M*5 per game, 60 games in totalrepo / wrapper
v-d4rlVisionContinousabout 100k*5 per task, 3 tasks in totalrepo
RobomimicHybridContinousLift Real (PH) 1.9G/Can Real (PH) 5.3G/Tool Hang Real (PH) 58Gwebsite
MimicgenHybridContinous12 tasks, 48k demoswebsite
RoboTurk PilotHybridContinous1k demoswebsite
评论

Dear Reviewer zGRd,

As the discussion deadline is approaching (<3 days), we are actively looking forward to your further feedback. Thanks for your effort and understanding!

Kindest regards,

Authors of ICLR Submission 1467

评论

Dear Reviewer zGRd,

As the discussion deadline is approaching, we are actively looking forward to your valuable feedback and would be very grateful if you could take a moment to review our responses.

We sincerely appreciate your precious time and consideration!

Kindest regards,

Authors of ICLR Submission 1467

评论

Thank you for your clarification. I like your detailed results and I understand the difference between JOWA and FICC. However, I am not convinced by the experiments on Atari to evaluate the scaling laws. Like the Sora you've mentioned, it is trained toward real-world prediction. I guess you want to mention Genie or sth like that. But they have no such claim.

Nevertheless, I have raised my score from 3 to 5.

评论

Thank you for your valuable feedback and for raising score. We truly appreciate your time and effort as a reviewer!

As researchers in RL, we would be incredibly grateful if you could share your insights on why Atari might not be suitable for verifying RL scaling laws, and what benchmarks you would consider more appropriate. While we fully understand this goes beyond typical review responsibilities, your expert perspective would be invaluable for guiding our subsequent series of work in this area.

We genuinely eager to learn from your expertise and would deeply appreciate any thoughts you're willing to share, as they would help us better understand this field.

评论

We thank reviewers for all the valuable feedback. We address all the reviewers' comments below and have incorporated all feedback in the revised manuscript using blue font. Specifically, we have updated the following:

  1. Add several relevant works in section 2 (Reviewer zGRd, Reviewer 7Vxq).
  2. Employ FICC as a model-based baseline for main experiments in section 5 (Reviewer zGRd).
  3. Add JOWA's scaling results and conclusions in the fine-tuning experiments, i.e., section 5.5 (Reviewer XU9g).
  4. Add a comparison with MCTS in section 5.6 to emphasize the efficiency of our planning algorithm (Reviewer zGRd).
  5. Add a description of the scaling inconsistency in the limitations (Reviewer 7Vxq).
  6. Add additional experiments in Appendix F, including details of MCTS (Reviewer zGRd), fine-tuning results using non-expert data (Reviewer pLGR), and explanations of the emergent behaviors (Reviewer 7Vxq).

We sincerely aspire that our detailed rebuttal will dispel any uncertainties or misunderstandings which reviewers may have raised regarding our manuscript, thus contributing positively to the final ratings of this work. If any additional experiments are needed to further demonstrate the potential of JOWA, we will do our utmost to supplement the relevant experiments during the valuable discussion period.

评论

We are deeply grateful to all reviewers for their thorough evaluation of our work and their constructive feedback. We particularly appreciate their recognition of our paper's key strengths:

  1. Strong Baseline Performance: Reviewers 7Vxq, pLGR, and XU9g
  2. Scalability: Reviewers zGRd, 7Vxq, pLGR, and XU9g
  3. Adaptability: Reviewer pLGR
  4. Comprehensive Ablation Studies: Reviewers zGRd, 7Vxq, and pKew
  5. Open-Source codes: Reviewer XU9g

Through productive discussions during the rebuttal period, we have made substantial improvements to our paper to address the reviewers' concerns:

  1. Added additional relevant works (Reviewers zGRd, 7Vxq)
  2. Added a model-based baseline for main experiments (Reviewer zGRd)
  3. Added JOWA's scaling results and conclusions in fine-tuning experiments (Reviewer XU9g)
  4. Added a comparison between our planning algorithm and MCTS (Reviewer zGRd)
  5. Added a discussion of scaling inconsistency in the limitations section (Reviewer 7Vxq)
  6. Added fine-tuning results using non-expert data (Reviewer pLGR)
  7. Added the explanation of emergent behaviors (Reviewer 7Vxq)

After rebuttal, the main remaining concern from reviewers is that Atari might not be an ideal benchmark for evaluating RL's scaling law (Reviewer zGRd). We will carefully consider the reviewers' suggestions and design experiments based on their comments in our subsequent series of work, such as expanding to continuous action spaces.

We extend our sincere gratitude to the reviewers for their valuable insights, and to the Area Chair, Senior Area Chair, and Program Committee for their dedication in organizing this review process!

AC 元评审

This paper introduces a model-based RL agent, JOWA, which trains a single transformer architecture by jointly optimizing the world dynamics and Q-values across different environments. To compensate for inaccurate Q-values, the learned world model enables JOWA to search out the optimal policy via planning as well, and JOWA can be fine-tuned online and offline as well.

The reviewers generally liked the paper and the only concern was about training on Atari to plot scaling laws. There's prior work which uses multi-game Atari to evaluate scaling laws, and while I think this is not the end goal for the RL community (and indeed, this paper does only consider 15 Atari games), this work is a good starting point for scaling of model-based RL.

I am also wondering if comparisons to TD-MPC2 could be added to the mix as well (I understand it cannot be applied here directly, but with approximations with action and/or state discretization), since that's a model-based method as well which shows scaling curves, although on different domains. Nonetheless, the paper is of interest and value, and we are accepting this paper.

审稿人讨论附加意见

See above.

最终决定

Accept (Poster)