Scaling Laws for Pre-training Agents and World Models
摘要
评审与讨论
The work presents a scaling law study examining the behavior of action and observation prediction models (behavior cloning and world models, respectively). The main results characterize trade-offs between model and dataset scaling given a fixed compute (FLOPS) budget. One result evaluates world model prediction performance based on the size of the discrete token vocabulary for observations. A second result evaluates behavior cloning performance based on using discrete or continuous observation embeddings. Additional analyses justify the choice of next-token loss for downstream task performance and use other domains (text and robotics) to examine potential causes for the different scaling behaviors.
The main findings are that:
- Increasing the discrete token vocabulary of a world models leads to better scaling in model size compared to data.
- Discrete observations in behavior cloning favor smaller models with more data compared to continuous observations. Scaling laws like in LLMs are less clear (under the experimental FLOP budgets) for discrete observations.
update after rebuttal
The replies addressed my questions, but my score was already quite positive. No changes.
给作者的问题
- [Q1] What is the conclusion to draw for image encoders being discrete vs continuous?
- The CNN (continuous) model has higher loss and asymptotes being more shallow. Is this connected to the size of the latent representation?
- The tokenized (discrete) model has not started plateauing in the same way.
- [Q2] Is the use of parametric fit distorting these results at the highest FLOPS due to models not saturating? Did the discretization have too many tokens?
- [Q3] How does inference speed scale with model size?
- This is not crucial, but if the data was already on hand it would strengthen the results.
- Specifically this seems important in the behavior cloning case, where the learned model would need to run inference at roughly 10 Hz (to keep up with a game).
- [Q4] What is the correlation coefficient for Figure 11?
- The association looks weak to the eye. And there is also a potentially small effect size given the very compressed scale.
- [Q5] Is there any analysis on the discretized vocabulary size of BC that would be comparable to the world model tokenization size experiments?
- Given the lack of saturation and the compression ratio results, it would be helpful to have a similar parallel analysis for the BC case. This is a "nice to have" that would strengthen the overall narrative.
论据与证据
The claims and their evidence:
- claim: World models show power law scaling.
- evidence: Figures 1, 5, 6 and the associated experiments training a world model on Bleeding Edge data.
- claim: Imitation learning models show power law scaling.
- evidence: Figures 1, 7, 8 and the associated experiments training behavior cloning on Bleeding Edge data.
- claim: Tokenized imitation learning models favor smaller models with more data.
- evidence: Table 1 on the frontier fit analyses (and parametric fit for BC-Token-540). Also Figure 7.
- claim: Continuous observations for imitation learning models favor larger models.
- evidence: Table 1 on the frontier fit analyses (and parametric fit for BC-Token-540). Also Figure 8.
- claim: The trade-off between model and dataset size for world models is correlated with the number of tokens per observation.
- evidence: For the base experiments Table 1 and Figures 5 and 6.
- evidence: A test on RT-1 scaling training varied tokenization sizes and world models (Figure 11).
- claim: Next token prediction loss is a good proxy for reward achieved by behavior cloning.
- evidence: Figure 2, which re-analyzes a previous study on model scaling for behavior cloning. This shows correlation coefficients in the range of -0.94 to -0.98.
- claim: Next token prediction loss is a good proxy for image reconstruction quality achieved by a world model.
- evidence: Figures 3 and 7, which analyze image reconstruction quality (FVD and LPIPS) compared to world model size. These show correlation coefficients of 0.83 and 0.77.
- claim: The amount of token-level supervision and label super-classing explain the shallower scaling behavior of behavior cloning models compared to world models.
- evidence: Shakespeare text prediction scaling analyses that manipulate prediction tasks and class aggregation.
方法与评估标准
Yes. Scaling on game data in an "infinite data regime" is appropriate to measure scaling laws. The main body and supplement provide evidence that the loss in this regime is strongly correlated with reward for behavior cloning (correlation coefficient larger magnitude than 0.9) and for observation prediction (aka world modeling, correlation coefficient 0.83 for FVD and 0.77 for LPIPS). While the evidence is from different game environments, it provides reasonable grounds to extrapolate.
Ideally the games used would be more diverse (than maps from a single environment), but that is not strictly needed for this type of analysis. It may alter the scaling coefficients observed due to the heterogeneity in observations and actions, but it would be surprising to learn that other games violated these patterns in the main (and worth separate study).
The supplemental analyses of Shakespeare text and RT-1 observations add support to the core scaling claims, though I have some modest concerns about the methods and how well they would translate across model tasks (see below).
理论论述
No proofs are in the paper.
实验设计与分析
Re-analysis of behavior cloning scaling in prior games. No obvious issues, though I am not familiar with the original paper and not sure if the assumption of an infinite data regime matches.
The scaling analyses for behavior cloning and world modeling in Bleeding Edge. The questions below addresses specific points. Most pressing is how to interpret the BC comparison given the methodological fitting differences between the discrete and continuous cases. It is not clear what conclusions we can draw from differences in the methods and given the lack of loss saturation given the FLOPS budget for the discrete case.
World model fitting extrapolation. No obvious issues, but it returns to the concern about drawing conclusions in the case where compute requirements prevented full testing.
Character prediction analyses of proposed mechanisms for different scaling behavior. These were a useful supplement to add evidence in favor of the hypotheses around lack of supervision and token granularity to explain the WM vs BC differences. My only (minor) concern is methodology: the fits are all estimated using parametric fit, despite preferring frontier fit for most of the other results in the paper. This is particularly true since the BC-CNN scaling results seem sensitive to this choice.
RT-1 observation encoding results to explain world model scaling behavior in model parameters compared to tokenization vocabulary size. These are helpful to have for more context, but raise questions about how strong the relationship is.
补充材料
Yes. All of it.
Specifically useful parts:
- WM pretraining loss compared to outcome metrics (FVD and LPIPS).
- Dataset details to understand the diversity and nature of the data source.
- I skimmed information on training hyper-parameters, the character-level and RT-1 analyses.
与现有文献的关系
The prior literature on scaling laws and methodological improvements are connected to the choice of methods in this paper. The results in the paper are contrasted to prior work on auto-regressive modeling in video and images, where the newer methodology contradicts past findings. The paper also discusses prior scaling law research in the embodied context, including the work that is re-analyzed for behavior cloning.
遗漏的重要参考文献
No references were missing that I consider essential.
其他优缺点
strengths
- clarity: Explains the methodological choices made and their application that helps introduce readers less familiar with the technical aspects of the neural scaling law literature.
- rigor: The main claims are supported by convergent lines of evidence aside the main results. This includes the correlation studies for BC and WM with downstream task performance and follow-up experiments around the scaling variation hypotheses in the Shakespeare and RT-1 domains.
- significance: Establishing trade-offs in scaling architectures for embodied AI (at least in games) is important as these techniques see wider adoption. The methodological rigor and references here will help that as well.
weaknesses
- clarity: The conclusions about WM scaling are hard to understand. The meaning is obstructed by jargon. There are many ideas being packed in, but it's hard to understand the overall narrative.
- It would help to make some statement along the lines of "When data is limited, favor scaling X." "When model size is constrained, favor scaling Y." for each case of: the WM tokenizer vocabulary size, BC tokenizer method.
- Similarly, the BC scaling conclusions are difficult to understand as stated.
其他意见或建议
I've included any non-critical questions in this section.
- Figure 1 can be misleading around BC scaling. Multiple times I've misread the figure due to the magnitude differences of losses for the tokenized models (roughly 0.4 to 0.5) and continuous models (3.5 to 4.0). It may help to more explicitly call this out or somehow make it more apparent that the CNN losses are roughly an order of magnitude higher. This difference is important when considering the practical implications of the work for downstream tasks.
- More generally, it may help to provide some remarks on what this implies for embodied AI training.
- What are the implied best practices from these results?
- It seems that the conclusion is training a discrete model is best with as much data as possible is best (modulo saturation concerns for a given compute budget).
- Under what conditions of data limitation (for a given compute budget) does it become beneficial to use a continuous model?
- Figures 2 through 10 are too tiny to read. It would help to make the key figures larger and move less crucial points to the appendix.
- Perhaps the figures justifying the correlation of infinite data loss to outcomes.
- "This matches the magnitude of decrease seen in Table 1 from 0.66 to 0.32, indicating that the proposed mechanisms explain our findings." (line 370)
- This was confusing as stated. After some looking I believe it's intending to reference C^a for BC-CNN under Frontier Fit and BC-Token-540 under Parametric Fit.
Many thanks for taking the time to review our paper in such detail, and raising several points we had not considered. We are pleased to see the value of our contributions has been recognized. Below, we respond to your main questions, followed by your ‘other comments’ and finally several points raised elsewhere in the review.
- Q1. Discrete vs continuous encoders for BC.
Thanks for this comment – we appreciate the reviewer’s insight into trying to reconcile the two plots, which is something we agree is worth understanding in more depth. First, as discussed in Other comment 1. caution should be applied as the two BC losses are in fact subtly different, e.g. one should not say the BC-CNN loss is ~10x higher than BC-Token’s. However, the fact BC-CNN asymptotes quicker, means those models are approaching the natural entropy in the data ( in Eq. 7) at modest compute budgets, while BC-Token does not show signs of this. One hypothesis is that BC-CNN models are therefore more compute efficient, though an alternative hypothesis is that differences in loss modelling cause this. We propose to investigate this further when training the new smaller BC-Token models requested by Reviewer TxDt, by computing the equivalent BC-CNN loss for the new models, to better understand how these two losses align. Again, thanks for the suggestion.
- Q2. Parametric fit distorting results
We assume the reviewer is referring to the BC-Token experiment? We expect that following Reviewer TxDt’s recommendation to train a set of smaller models to encourage saturation, we will be able to provide coefficients for the frontier fit method resolving this issue.
- Q3. Inference speed and model size
We can confirm all BC-Token and BC-CNN models considered in our experiments can be run at >10Hz on a V100 GPU. We do not currently have numbers to hand on how inference speed changes with model sizes.
- Q4. Figure 11 correlation
The computed correlation comes out as 0.61, which is considered a moderate strength relationship. We will add this to the figure caption. The high variance comes since a set of models are trained for each point – one model in the family failing to converge properly affects the coefficient entirely. This was something that occurred more often for models trained on larger tokens-per-image. We believe with further optimizing of training schedule these models could converge more consistently and the pattern would emerge even stronger.
- Q5. Investigating effect of tokenization of actions on coefficients
This is a nice suggestion we had not thought of! Does the action tokenizer effect coefficients in similar ways to image tokenizers? Whilst most action dims are natively discretized (buttons), the natural lever to play with here would be number of bins for the continuous joystick dims. But we could also consider grouping together various button combinations to require less tokens per action prediction. We feel this goes a little beyond the current scope of the paper but would be excited to see investigation of this in follow up work.
- Other comment 1. Figure 1 with differing loss scales
We apologize for not making this clearer. In fact, the loss magnitudes are not directly comparable between BC-Token-540 and BC-CNN. Each model optimizes a subtly different losses; BC-Token-540 predicts action dimension-by-dimension, while BC-CNN produces predictions for each action dimension independently (Section 3.2). Our implementations also aggregate the losses in different ways. We will edit the figure caption to make this clearer.
- Other comment 2. Small figures
Apologies, we will expand figure sizes in the next version.
- Other comment 3. Line 370 unclear
Apologies for this, you were correct in your guess. We will clarify.
- Comment 1. Re-analysis of BC ... unsure about infinite data assumption
To confirm, Tuyls et al. used pre-trained policies so could generate unlimited data and remain in the infinite data regime.
- Comment 2. Character prediction experiments used parametric fit
We agree frontier loss would be preferable. While we could have used frontier fit for the Dense loss, this did not appear possible for spare loss super-classed (Figure 10). We favored consistency in our fitting method across these experiments.
- Comment 3. Conclusions about scaling hard to understand Thanks for this comment. We have focused on presenting nuanced details from the scaling analysis and have so far not offered clear conclusions for practitioners. Would the reviewer agree with the below concrete recommendations?
- When training BC-Token models, model sizes should be substantially smaller than for BC-CNN.
- For WM-Token models, increasing tokens-per-image of the tokenizer should be done in parallel with increasing model sizes.
- As per response to Q1. we may be able to advise a preference for BC-CNN models, but await our analysis contrasting BC-CNN and BC-Token losses.
Thanks for the detailed replies! I'll only remark on open topics in my mind.
In general I do not share as large a concern about establishing the level of correlation to more complex environments. This will likely be lower than toy environments (for smaller foundation models), but unlikely to be so insubstantial to render the models useless. While that may be true, it would merit a separate publication in it's own right (and perhaps is a follow-on to consider with Bleeding Edge).
Comment 3
I'm most interested in the results from the third suggestion, as that seems the most interesting outcome of the experiments that have been done so far.
The other two claims are sensible and would be welcome to highlight.
Q1
I had not realized this difference. I would like to see the new experiment results and loss scaling to better understand the implications. As mentioned above, I found it difficult to translate the results into whether tokenization or CNN was better at these model scales. Understanding this kind of scaling behavior is very important with the growth of multi-modal foundation models in general (which often incorporate vision as an input modality). At least the matched scaling seems important to verify before acceptance.
Q5
The other idea that came to mind was to apply a different discretization scheme. The FAST tokenizer may be one option to consider for the discrete input spaces (https://arxiv.org/abs/2501.09747). Note that really it's a recipe for many such frequency domain tokenization approaches, leaving the choice of compression technique as a free parameter.
This paper investigates the scaling laws in embodied AI. Specifically, this paper focuses on the infinite data regime and generative pre-training objectives, which include behavior cloning and world modeling. The scaling laws are observed in the following two cases through the experiments. One is world modeling with tokenized observations and actions while the other is behavior cloning with one continuous encoding per observation. Experiments and analysis are conducted in the domains of video games and robotics.
给作者的问题
Please refer to the issues raised above.
论据与证据
My primary concern is that this paper is built upon the assumption of an infinite data regime, where samples are not trained more than once. While this assumption holds for LLM on NLP pre-training, is it applicable in the cases used in this paper, such as specific video games or robots in a fixed environment?
方法与评估标准
This paper claims the pre-training loss could be used to represent the downstream performance and thus serve as the indicator for scaling laws. However, this connection is simply obtained from small-scale environments like Atari and NetHack. Extending this observation to more complex scenarios (e.g., video games and robotics) has no clear evidence or guarantee. Hence, whether the observed scaling laws can represent the real robot performance remains unclear.
理论论述
No theoretical claims are provided in the main paper.
实验设计与分析
This paper only uses generative pre-training loss for evaluation. How this proxy reflects real performance is lacking.
补充材料
I have read the appendix.
与现有文献的关系
Compared with previous works, this paper chooses to analyze the scaling laws of embodied AI by generative pre-training loss in the infinite data regime.
遗漏的重要参考文献
To the best of my knowledge, the references are sufficiently covered.
其他优缺点
- The overall paper is not well organized. For example, all figures in this paper are too small.
- The subsections in this paper are also not well structured. For example, it is better to put Sec. 2.2 into section 3 due to Sec. 2.2 can be viewed as the foundation of the method. Also, putting Sec. 3.3 to experiment section would let the reader easy to follow. Overall, the paper flow and the figure layout are suggested to be improved.
其他意见或建议
Please refer to the issues raised above.
Thank you for taking the time to review our paper. Please see below our responses.
- Comment 1. Infinite data regime assumption
Thank you for drawing attention to this important detail. To clarify, all experiments in our paper for both domains (video games and robotics) are conducted in the infinite data regime (Section A.3.1 computes the amount of FLOPs that would violate this assumption in our set up for the main experiments).
Investigation into scaling laws outside the infinite regime is an active research area in LLM research (Scaling Data-Constrained Language Models). As we understand, the current thinking is that scaling laws do still exist in this setting, but are influenced in adverse ways by repeated epochs.
We believe the reviewer is correct to call out this as a line of research very relevant to embodied AI, which typically falls within the data-constrained regime. We believe it falls outside the scope of a first investigation into scaling laws in embodied AI, for which we have chosen to focus our investigation on the effect of tokenizer, task and architecture.
- Comment 2. Pre-training loss and downstream performance
We have done our best to frame all claims in the paper as optimizing for loss in the pre-training phase of agents, rather than for downstream performance, as these are the concrete results we have. We have further provided conceptual arguments and experimental evidence on the link between pre-training loss and performance. Note the Nethack and Atari experiments, while being of limited complexity, are not necessarily small-scale (up to 10^18 FLOPs).
Providing further evidence effectively for our dataset, environment and setup, we feel would require resources beyond the scope of this paper. We would require implementing a form of post-training phase to maximize some reward signal, and engineer an automated, distributed solution to enable rollouts in this game at large scale (something Bleeding Edge does not natively do).
In general, we agree with the reviewer that the link between pre-training loss and online performance is important enough to deserve further study, but leave this to a separate investigation.
- Other comments
Thank you for these suggestions. Should we be fortunate enough to have the paper accepted, we will take your suggestion and use the allowed extra page to enlarge figures for clarity. Regarding the paper organization, please confirm whether other reviewers support your proposed changes. We would then be happy to reorder the paper as you suggest in the next version.
Thanks to the authors for providing the rebuttal. I've read the author's response and comments from other reviewers. I have no further questions at this time. I will increase my original rating to 3.
This paper investigates the existence and characteristics of scaling laws in embodied AI tasks, specifically world modeling (WM) and behavior cloning (BC), drawing parallels to scaling laws observed in large language models (LLMs). Through extensive experiments on large-scale video game datasets (e.g., Bleeding Edge) and robotics data (RT-1), the authors demonstrate that power-law relationships between model size, dataset size, compute, and pre-training loss also apply to WM and BC. Key findings include:
-
Scaling laws for WM are influenced by tokenizer compression rates. Higher compression rates lead to more reliance on dataset size.
-
BC with tokenized observations requires prioritizing dataset scaling under modest compute budgets, while BC with CNN-based architectures favors model scaling.
-
Small-scale language modeling experiments validate mechanisms behind observed scaling phenomena (e.g., sparse supervision and target granularity).
给作者的问题
N/A
论据与证据
All claims are supported by thorough experiments that are carefully designed.
方法与评估标准
Yes
理论论述
N/A
实验设计与分析
All experimental are carefully designed and conducted.
补充材料
Yes, Figure 16.
与现有文献的关系
Scaling laws play a crucial role in the development of large AI models, such as LLMs. This paper makes significant contributions to the understanding of scaling laws in agent learning by utilizing expressive architectures. These insights will be instrumental in unifying scaling laws across various domains
遗漏的重要参考文献
To the best of my knowledge, all related works are cited/discussed.
其他优缺点
Strengths:
-
The work bridges a critical gap in understanding scaling laws for embodied AI, extending principles from LLMs to WM and BC. This is timely, given the growing interest in scaling generative models for robotics and interactive agents.
-
The paper validates findings across diverse datasets (video games, robotics) and architectures (tokenized vs. CNN-based). The inclusion of small-scale language experiments to explain mechanisms adds methodological depth.
-
The analysis of tokenizer compression rates and architecture choices provides actionable guidelines for optimizing compute allocation in real-world applications.
-
I really like the meta-analysis linking pre-training loss to downstream metrics (e.g., FVD/LPIPS for WM), which strengthens the case for using pre-training loss as a proxy in scaling studies.
Weaknesses:
-
The existing datasets for robotics lack sufficient diversity, which may limit the value of the conclusions presented in the paper, particularly when downstream tasks emphasize generalization capabilities [1]. The RT-1 robotics dataset’s small scale also raises questions about extrapolation to large-scale datasets in robotics like OXE dataset.
-
The observed effects of tokenizer compression rates and target granularity lack theoretical grounding. While empirical results are compelling, deeper mechanistic explanations would strengthen the work.
[1] A Taxonomy for Evaluating Generalist Robot Policies
其他意见或建议
N/A
Thank you for your review. We are delighted to have successfully communicated the value of our study. We agree that extending principles from LLMs to embodied AI tasks is a timely and important avenue of research. Allow us to respond to your comments below.
- Comment 1. RT-1 dataset has limited diversity and is small scale. How do conclusions apply to tasks where generalization is important, and larger scale datasets.
We recognize that the RT-1 tasks, policies and visuals contain only limited diversity. On the other hand, the Bleeding Edge dataset contains a huge amount of diversity, with a pool of seven maps, tens of thousands of demonstrators of differing skill levels. While these two datasets are from differing domains, we did not observe contradictions between the two sets of experiments. We hope that analyzing the results in combination will give researchers enough evidence that similar patterns would also emerge in a large diverse robotics dataset.
- Comment 2. Effects of tokenizer compression rates and target granularity lack theoretical grounding
Thank you for this comment. This is one of the directions for future work we are also most excited about, and have started very initial experiments in this direction. We sketch our current thinking on this issue in case of interest.
We have begun considering the case of lossless compression, where more-tokens-per observation (lower compression) should make next-token-prediction easier, both in the sense of the pattern being simpler (less parameters required) and less stochastic noise (less data required). We suspect this leads to lower values of both alpha and beta. But the key question is which shrinks faster (since their ratio is the thing that matters). Which might be harder to reason about without further assumptions.
One thing that makes analysis tricky, is that losses may not be consistent across tokenizers even for the lossless case, and in reality, tokenizers become more lossy at higher compression rates.
Regarding this paper submission, we believe that a deeper analysis of compression rates is best left to a separate paper, and perhaps domains more straightforward than embodied AI. Should you be interested in exploring this direction further, please reach out following the review period.
The paper explores scaling laws in embodied AI, especially for the pre-training stage of world models and agent behavior. The authors show that power laws similar to those in LLMs are observed in world modeling and behavior cloning, but with coefficients influenced by the tokenizer, task, and architecture. The study provides insights into optimal model and dataset sizing for these tasks.
给作者的问题
- Additional evidence to support the claim that the pre-training loss correlates strongly with the final performance metrics will definitely be useful. Especially because the domain of embodied AI with agent modelling suffers from error accumulation over steps, resulting in out-of-distribution states and sub-optimal performance in the long tail of scenarios.
- The authors claim that the scaling laws for BC with tokenized observations are hard to observe under the current compute budgets. Could additional evidence using smaller models or a different dataset be added to show that the scaling laws actually satisfy the authors’ claims?
- Figure 9 shows the extrapolation of the scaling laws to a much larger model - additional examples of large scale models for the other claims will also help strengthen that they are satisfied over a wide range of model scales. (I do understand that this could require significant training time, so will only be a positive if it can be added)
论据与证据
- Scaling laws similar to those in LLMs can be observed in world modeling with tokenized observations and actions → Partially insufficient
- The authors show extensive experiments to prove the scaling laws for training loss, but only minimal examples of how this training loss translates to real world performance on embodied AI tasks. Additional tasks to show the training loss as a proxy would make this claim stronger.
- The optimal trade-off between model and dataset size in world modeling is influenced by the tokenizer’s compression rate → Sufficient
- Two examples of tokenizers are used to support this claim. Additional examples using the small RT-1 dataset also support this claim, but the sizes used are quite small due to the limited size of the dataset.
- Scaling laws for BC with tokenized observations are harder to observe under modest compute budgets. The optimal trade-off favors smaller models and more data → Partially insufficient
- The authors claim that models with size > 2M params don’t saturate over the flops range considered and use the parametric fit instead of the frontier fit method. Can additional models with smaller sizes <2M be used to show that the parametric fit and frontier fit curves match for smaller sizes?
- Scaling laws similar to those in LLMs can once again be observed in BC with one continuous encoding per observation → Sufficient
- The power law curves for training loss with BC-CNN show a strong trend as claimed.
方法与评估标准
The benchmark dataset used from the game Bleeding Edge is a reasonable choice for the proposed claims, but might not be sufficient to show that the claims also hold for other examples of embodied AI - especially real world tasks with a long tail of difficult scenarios. The authors show that the training loss has a strong correlation with the final task performance for the few tasks studied, but it’s not clear if these will also hold for other general cases of BC and world modeling (like discussed in [1]), as is even discussed for LLMs in [2].
Also, previous studies like [3,4] have shown significant impact of dataset quality on downstream performance for agents. Also, it would be very useful if the 7 Maps dataset is available publicly to allow reproducibility of this research.
[1] Codevilla, Felipe, Eder Santana, Antonio M. López, and Adrien Gaidon. "Exploring the limitations of behavior cloning for autonomous driving."
[2] Liu, Hong, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. "Same pre-training loss, better downstream: Implicit bias matters for language models."
[3] Bronstein, Eli, Sirish Srinivasan, Supratik Paul, Aman Sinha, Matthew O’Kelly, Payam Nikdel, and Shimon Whiteson. "Embedding synthetic off-policy experience for autonomous driving via zero-shot curricula."
[4] Belkhale, Suneel, Yuchen Cui, and Dorsa Sadigh. "Data quality in imitation learning."
理论论述
N/A all the claims in the paper are empirical and supported by experimental results.
实验设计与分析
The experiments are well designed and detailed to support the claims for the proposed domain of agents and world modelling for the Bleeding Edge game. Additional evidence of final game performance of the agent and world model using the proposed pre-training setup to show the strong correlation between the training loss and the game performance will help solidify the claims further.
The extension of these claims however to the entire embodied AI domain for agents and world models aren’t really conclusive - especially because the variation between optimal dataset and model sizes varies quite a lot between the different tasks and tokenizer.
补充材料
Yes, the appendix at the end of the submission is a useful addition - especially the training and other implementation details help with the understanding of the paper and the experiments conducted to obtain the results.
与现有文献的关系
The contributions of the paper, especially showing that the scaling laws from LLMs extend to embodied AI are significant, and will accelerate the development of models for the community by providing suggestions for optimal model and dataset sizes. The claims however feel stronger than the actual results, especially since the scaling laws aren’t really general and influenced by the specific task, tokenizer and architecture.
遗漏的重要参考文献
None that are obvious or very well known.
其他优缺点
Strengths
- The authors successfully demonstrate that power laws similar to those in language models also apply to world modeling and behavior cloning. This extension is crucial as it guides researchers in optimizing resource allocation for embodied AI tasks.
- The paper employs a rigorous methodology, using clear definitions and complementary fitting methods to establish scaling laws.
- Key findings include the influence of tokenizer compression rate on optimal model size in world modeling and the preference for smaller models with more data in behavior cloning with tokenized observations.
Limitations:
- While the paper justifies using pre-training loss as a proxy for performance, this approach has limitations. Studies have shown mixed results regarding the correlation between pre-training loss and downstream performance.
- The paper does not fully address domain-specific challenges such as physical interaction complexity and the lack of long tail scenarios.
- The curves generated using parametric fit definitely support a stronger claim than those with the parametric fit.
- The paper could benefit from more detailed analysis of dataset diversity and potential biases.
- The dataset used for the experiments is proprietary, as well as the computational requirements for reproducing these results are substantial, limiting accessibility for researchers with fewer resources.
其他意见或建议
N/A
伦理审查问题
N/A
Thank you for your review, we are pleased your judgement of the paper comes out on the side of acceptance and agree that your suggestions would further improve the paper. Due to logistical constraints, we are not able to complete all of these requests, but commit to completing the smaller scales BC-Token experiments (which we agree is an excellent idea), before any camera-ready deadline. Below we respond to your three primary questions, and several further key comments we noted.
- Q1. Additional evidence … to show the correlation between the training loss and game performance
We have done our best to frame all claims in the paper as optimizing for loss in the pre-training phase of agents, rather than for downstream performance, as these are the concrete results we have. We have further provided conceptual arguments and experimental evidence on the link between pre-training loss and performance but agree this should continue to be explored through future work. Doing this effectively for our dataset, environment and setup, we feel we would require resources beyond the scope of this paper. We would require implementing a form of post-training phase to maximize some reward signal, and engineer an automated, distributed solution to enable rollouts in this game at large scale (something Bleeding Edge does not natively do). As such, similar to the LLM reference noted by the reviewer (their [2]), we believe this is best left to a separate paper.
- Q2. Smaller models in the BC-Token experiments
Thank you for this very sensible suggestion! Our original protocol had aimed for consistent model sizes across tasks and tokenizers, but on reflection, we agree that smaller models could be used for the BC-Token experiments. Logistically this will be difficult to complete within the rebuttal period (very small models do not fully utilize GPU FLOPs and hence are very time inefficient), but we are happy to commit to completing this before any camera-ready deadline (should we be fortunate enough to have the paper accepted).
- Q3. Additional extrapolation experiments
We agree with the spirit of this request – the gold standard test for scaling laws is how well they predict orders of magnitude out. As the reviewer mentions, this would require more compute than used in our current paper, which is a hard for us to commit to. We hope our current results, which span around three orders of magnitude, are sufficiently interesting for much of the embodied AI community working which has not yet advanced to the model sizes in LLMs.
- Comment 1. Claims feel stronger than results … since the scaling laws aren’t really general and are influenced by the specific task, tokenizer and architecture
Please let us know if there are specific wording changes you’d recommend. We aimed to write the paper in a way that would engage researchers across embodied AI, whilst avoiding overclaim or hype. Power laws did consistently emerge across the two domains tested, and one of our main contributions is that task, tokenizer and architecture do impact coefficients. This is an important departure from LLM research, where datasets, tokenizers and architectures usually only see minor variations.
- Comment 2. Dataset is proprietary, reproducing these results are substantial, limiting accessibility for researchers with fewer resources
The Bleeding Edge dataset was accessed under a data sharing agreement and unfortunately, we do not have agency to share ourselves. On the other hand, the RT-1 experiments are fully reproducible. This follows recent high-impact embodied AI work (Genie: Generative Interactive Environments) which described a lightweight environment for other researchers to experiment in. Note our LLM experiments are also designed to help researchers capture the essence of challenges we identified in a more accessible modality.
Thanks for the replies to the comments and questions. Overall, I don't think that I will change my rating and it will still be a weak accept as I feel that the paper will be a useful addition to the community, but it's not directly clear about the generalisation and trasferability of the research.
Q1 - I agree that this will be a useful, but not necessary addition to the paper to strengthen the claims, but can be left for a follow up like the authors say.
Q2 - Thanks, curious to learn more about the results.
Q3 - I understand the difficulty of running further scaling experiments, and since this is only an additional point which would help, we can skip this for now.
Comment1 - I appreciate the transparency in the results, and acknowledge that the authors do discuss the impact of task, tokenizer and architecture, and do not contest this. I just wanted to bring up my concern that the scaling laws don't transfer very well when one or more of these factors are changed, and so can't directly be used across different users unlike LLM results.
Comment2 - Thanks for the clarification, but unfortunately complete reproducibility of these research findings isn't possible, which would have definitely made the claims of the paper feel stronger.
This paper introduces the first scaling laws for embodied agents, concretely behavior cloning and world models on games such as Atari. This work seems like a rigorous starting point, however, impact may be limited by the small scale of the largest FLOPs budgets (<1e20) which remains orders of magnitude below the higher FLOP budgets in large-scale settings. For instance, Llama3 used 1e22 as the maximum budget for their scaling analysis.