PaperHub
5.5
/10
Poster3 位审稿人
最低3最高3标准差0.0
3
3
3
TL;DR

Our paper studies the scaling properties in driving, across multiple orders of magnitude in data, model size, and compute.

摘要

关键词
Autonomous DrivingFoundation ModelsBehavior Modeling

评审与讨论

审稿意见
3

This paper explores behavior modeling for autonomous driving and investigates the scaling properties from data to model parameters. The proposed method, DriveGPT, validates the benefits of scaling up both training data and compute, demonstrating improved model scalability as data increases—consistent with findings in language model scaling. To assess effectiveness, quantitative and qualitative comparisons are conducted across models from the scaling experiments. Furthermore, real-world deployment is showcased through closed-loop driving in challenging conditions, demonstrating the model's generalizability on the Waymo Open Motion Dataset, where it outperforms previous state-of-the-art methods in motion prediction.

给作者的问题

  1. How is the validation set divided?

论据与证据

This paper claims similar scaling properties in behavior modeling, supported by experiments. However, although scaling effects are observed in validation loss, the model's scaling effect is not significant. From Table 3, it appears that there is little improvement in performance beyond 94M.

方法与评估标准

This paper explores scaling properties using the same paradigm as LLMs. From the validation loss and the internal and WOMD evaluation metrics, the approach appears generally reasonable.

理论论述

No theoretical claims.

实验设计与分析

The overall analysis of the experiments is fairly thorough, with testing conducted on both internal experiments and WOMD. However, the results suggest that models around 100M perform the best. How do larger models perform? The scaling gains of the model appear to be not very significant, especially when compared to LLMs, where models with hundreds of millions of parameters are still considered relatively small.

补充材料

The supplementary materials provide videos, which show good performance in behaviors such as unprotected left turns and lane changes. The appendix section includes more ablation experiments.

与现有文献的关系

This paper aims to advance research in behavior modeling for autonomous vehicles. Previous work has mainly explored small datasets and models, while this paper explores the effects of scaling in behavior modeling, which is highly valuable for the future development of autonomous driving.

遗漏的重要参考文献

There is another line of methods, such as [1] and [2], that approach behavior modeling from an image modeling perspective and also use transformers for autoregressive action prediction. These methods can be briefly discussed in the related works section.

[1] DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

[2] DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

其他优缺点

  1. Is it possible to release some of the internal driving data in the future to facilitate further exploration?

其他意见或建议

  1. Is the quality of the data also important, such as trajectory mining in complex scenarios, rather than just scaling the data volume?
作者回复

We sincerely thank the reviewer for your review and positive comments about our work. We are encouraged by your recognition of the value of our work for the future development of autonomous driving.

We address your comments and questions below.


Larger models beyond 94M

We agree with the reviewer’s observation that geometric metrics stabilize beyond 94M in Table 3, yet we have found that semantic metrics, such as collision rate, continue to improve as the model scales to 163M, as shown in the last column of Table 3 and further discussed in Section 5.1.2 (Page 7, Line 355). These results suggest that increasing model parameters could yield additional benefits in actual driving performance. In the final version, we will provide more qualitative comparisons between the 94M and 163M models in terms of semantic driving performance.


Scaling gains compared to LLMs

We appreciate the reviewer’s observation regarding the scale of our models compared to LLMs, which are typically trained on trillions of tokens with billions of parameters. A key challenge in scaling driving models is the collection of large-scale driving datasets. Unlike text data, which is abundant and easily accessible, high-quality driving data is expensive to acquire, requiring extensive real-world deployment across diverse scenarios. As discussed in Section 1 (Page 1, Line 043), this makes scaling driving models inherently more challenging than scaling LLMs.

Despite being relatively smaller in scale compared to LLMs, our work represents the largest effort in scaling driving behavior models to date. As reviewer mYMk noted, “Even though it seems the conclusion must be that magnitudes of more data and FLOPs leads soon to diminishing returns, that is an insight that is potentially very interesting for the community.” We hope our findings provide meaningful insights to the community and contribute to the advancement of the next generation of foundation models for autonomous driving.


Image-based behavior modeling literature

Thanks for highlighting these methods. We will include discussions on DrivingGPT and DrivingWorld in the related works section of the final version. As discussed in the paper, our approach focuses on a simple and scalable autoregressive model architecture, leveraging commonly adopted vector representations in the field. We believe our insights could be extended to additional input representations and encoders in future work.


Data release

We are highly interested in releasing a subset of our internal driving data and providing additional metadata information to facilitate further exploration in the field, which is currently pending internal review.


Data quality

We carefully curated our dataset to encompass a diverse range of urban driving scenarios with balanced distributions, including lane changes, intersections, double-parked vehicles, construction zones, and close interactions with pedestrians and cyclists. We believe that further improving data quality and sample diversity could enhance scaling results and defer a more comprehensive study as future work. We will add more details in the final version.


Validation set

The validation set was curated to include 10M samples sharing the same distribution as the training set but with no overlap. We used the same validation set for all scaling experiments for consistency. We will add clarifications in the final version.


Thank you once again for your time and thoughtful feedback. We hope that we have addressed your questions and that you will consider supporting the acceptance of our manuscript.

审稿意见
3

The paper presents a large transformer model predicting future ego agent states in a Birds Eye View for autonomous driving. The focus lies on an investigation of the scaling properties of transformers for behavior modeling by increasing the model and dataset size significantly. The method beats some baselines on the Waymo Open Prediction test Dataset and can be transformed into a prediction planning method to drive in real life.

update after rebuttal

Given the rebuttal of the authors and comments of other reviewers I do not see controversy in leaning towards acceptance. Therefore I leave my original rating.

给作者的问题

The authors should remark on comments where something in the paper is lacking or missing.

论据与证据

  • Present DriveGPT (fulfilled with the paper)
  • Determine empirical driving scaling laws for auto-regressive behavior models.

Detailed investigations with fixed or variable FLOP budgets, parameter and dataset sizes and comparison with baselines are given, e.g. in Figure 5 and 6.

  • Validate in real-world scenarios and closed-loop driving

This is shown in one video and described in one small paragraph. While interesting this seems under-reported.

  • Outperform SOTA on WOMD

The model seems to be among the best, with very good minADE and minFDE values. Miss rate and Soft mAP are better for other approaches which the authors report themselves.

方法与评估标准

The WOMD test set is used widely in the domain and is suitable for comparing marginal prediction as done with this model.

理论论述

Claims are empirical and not theoretical.

实验设计与分析

The method has a very simple straightforward design which is not hard to understand. The experimental design is common in the domain where approaches are compared on BEV views in terms of standardized metrics. There are no general issue apart from missing clarity of the closed-loop setting (more below).

补充材料

The video suggests good performance in real driving.

与现有文献的关系

The paper beats the state of the art but does not really try to improve on a methodological level a specific architectural trick or additional input modalities. The work positions itself well within conservative standards regarding the method but then investigates scalability using resources which are not easily available to others. The contribution to the literature is therefore probably biggest in terms of the study on scalability.

遗漏的重要参考文献

Some baselines are succeeded by newer approaches, e.g. MTR now competes in the Waymo challenge as MTR++. "MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying", Shi et al. (2024 arXiv). However, while the approaches are compared on the Waymo leaderboard the relevant paper is not yet peer reviewed and published. So in this fast moving field there are some references not discussed but for this study it is probably justified to compare to the field at a decently recent point in time as the authors did.

其他优缺点

Strengths:

The approach seems to assign a significant probability to alternative paths which is a good thing. Comparable approaches can suffer from mode collapse, predict only one solution and then struggle when the driving situation changes quickly or in an unexpected way. This is well illustrated in Figure 11.

The architecture is actually very simple. For an investigation into scaling laws this is positive to make it easier to attribute any performance gains.

The range of investigations, including those in the appendix, are interesting for the domain because they are hard to reproduce. Given the large training set and amount of GPUs used the contribution here lies also in providing a study that a group with less resources could not do.

Weaknesses:

Results on the WOMD test set do show a better minADE and minFDE rate but do not outperform MotionLM Ensemble on the Miss Rate or Soft mAP. Given the high amount of parameters and training examples, this would suggest the approach leads to diminishing returns. This is further suggested by Table 8 in the appendix.

The performance can not be reproduced or built upon from the very limited information about the "Large-scale driving dataset". The least amount of information which would be needed would be countries in which the recording took place, percentage of rural vs. city driving, percentage of daytime vs. nighttime driving and ideally some information about the data recording. It was shown in "A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving" Feng et al. (ITSC 2020) that performance across datasets can vary a lot even if they were recorded in the same country. Without camera input that effect will be smaller. However, to maximize the usefulness of the insights of this study more information should be given to be able to judge the amount of outliers, proportion of cyclists and humans, their customs of crossing the street and other regional features that can impact some methods more than others.

In Figure 1, it is not clear what the smaller baseline is. Information should be given about the proportion of training data used and architecture. Is it the same transformer simply trained on less data? If yes, on how much less?

There should be more information about the closed-loop driving. Presumably the system drove on real streets with a safety driver but is DriveGPT really used to drive the car? Detection of traffic lights, complex local rules which are potentially only in place during certain hours and ad-hoc instructions by traffic guiding police officers would be surely an issue. It is not clear which of the models was used on the street as well. Training happened on 16 H100 GPUs, how was driving realized? If inference needs the same amount of compute, was that realized in a car or remote? If in a car, was there some distillation involved? The information about this contribution are lacking a lot of details.

In summary, the paper seems to shows a moderately better performance, beating the state of the art incrementally. While there is no large methodological novelty the study is a valuable investigation. In parallel to scaling large language models, scaling driving performance is hard to estimate without large computational resources and datasets, which the authors both provided. Even though it seems the conclusion must be that magnitudes of more data and FLOPs leads soon to diminishing returns, that is an insight that is potentially very interesting for the community.

其他意见或建议

The Wayformer Ensemble metrics do not seem to fit to the Nayakanti et al. 2023 paper (which does not seem to be the Ensemble paper), the WOMD leaderboard of the challenge or the newer paper which should be the Wayformer Ensemble paper, "Scaling Motion Forecasting Models with Ensemble Distillation", Ettinger et al. (ICRA 2024). The results seem closest to the Nayakanti paper but this refers to older results from 2021. The authors should check the metrics (which are approximately correct not but 100%) and either correct or explain more specifically where they are coming from. There may be some confusion of what the Ensemble paper is.

作者回复

We sincerely thank the reviewer for your detailed and thoughtful feedback. We appreciate the positive assessment of our work as a valuable contribution to the autonomous driving literature through our large-scale scaling experiments, with simple straightforward experimental design and good performance in WOMD and real driving.

We address your comments and questions below.


Newer WOMD approaches

We thank the reviewer for pointing this out. We will include more recent methods in the related works section and additional results including MTR++ in Table 4 in the final version.

While MTR++ introduces a set of novel techniques to improve upon MTR, our method outperforms MTR++ in terms of minADE, minFDE, and MR, as indicated by Table 4 of our paper and Table 1 in the MTR++ paper.


Diminishing scaling returns

We agree with the reviewer’s observations on diminishing returns, which have also been noted in the LLM literature, and we appreciate the positive feedback on this insight. These findings are an important part of the story we want to share with the community to encourage further research in this direction. In the final version, we will clarify this discussion and defer exploring additional scaling trends using more data samples as future work.


Dataset details

Our dataset is collected from multiple countries and cities across North America and Asia, primarily through urban driving, with data evenly distributed across day and night. To maintain anonymity for the double-blind review, we will provide more details on the countries and cities in the final version.

As noted on Page 4, Line 171, we process camera, LiDAR, and HD map data into vectorized representations. While the training data includes only vehicle driving, each scene is captured in dense urban environments containing numerous cyclists and pedestrians exhibiting diverse behaviors, such as jaywalking, blowthrough, and riding in the opposite lane, as shown in our qualitative examples.

We will include all requested dataset details in the final version and are happy to provide any additional information if the reviewer identifies any gaps.


Figure 1 baseline

The baseline is an 8M model using the same transformer architecture, trained on 2.2M data. We will clarify this in the final version.


Closed-loop driving information

We use DriveGPT to drive a car in real time, taking input features from a perception system that provides agent states and map information. While the full system incorporates additional components to handle long-tail events, we showcase challenging scenarios in the supplementary video where DriveGPT alone is responsible for driving, demonstrating its effectiveness.

We used the 8M model, trained on the full dataset, to drive the car, achieving a latency of under 50ms. While training the model requires 16 H100 GPUs with a batch size of 2048 (see Page 11, Line 559), real-time inference only requires a batch size of 1 and can run on a single onboard GPU.

We will include all requested details in the final version and are happy to provide any additional information if the reviewer identifies any gaps.


Wayformer metrics

The Wayformer results reported in Table 4 of our paper are sourced from Table 1 in “Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding” (Zhang et al., NeurIPS 2023). In that paper, the authors attributed the Wayformer results to [Nayakanti et al., 2023] and explicitly labeled them as ensemble in the table caption.

In [Nayakanti et al., 2023], the authors stated in Section 5.4: “We further apply ensembling, a standard practice for producing SOTA results for leaderboard submissions.” The numbers reported in Table 1 (last but second row, LQ + Multi-Axis) of [Nayakanti et al., 2023] match those in Table 1 of [Zhang et al., 2023], confirming consistency.

To prevent confusion with [Ettinger et al., 2024], we will rename "Wayformer Ensemble" to "Wayformer" and clarify the source of the baseline metrics in the final version. We will also include additional discussion on [Ettinger et al., 2024] in the related works section.


Thank you once again for your constructive comments. We hope that we have addressed your questions and that you will consider supporting the acceptance of our manuscript.

审稿意见
3

This paper presents DriveGPT, a scalable behavior model for autonomous driving. The model has 1.4B parameters and 120M data are trained. DriveGPT is ∼3x larger and is trained on ∼50x more data sequences than existing published behavior models.

给作者的问题

Please refer to the concerns listed above, especilly for the comparisons with other methods.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

NA.

实验设计与分析

Yes. Good experiment results of scaling law of motion prediction models.

补充材料

Yes. The supplementary material contains qualitative results of planning trajectories of the ego-vehicle.

与现有文献的关系

Scaled up the model and dataset.

遗漏的重要参考文献

Yes. I browsed the WOMD leaderboard and noticed that some of the top-performing methods haven't been discussed, like [A] and [B].

[A] MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

[B] ControlMTR: Control-Guided Motion Transformer with Scene-Compliant Intention Points for Feasible Motion Prediction

其他优缺点

Strengths:

  • The paper is well-written and easy to follow.
  • This paper explores the scaling law of the motion prediction large models.

Weaknesses:

  • The main concerns are in the experiments. (1) The motion prediction results on WOMD (Table 4) only include works before 2023. The shown methods are not leading-edge enough at present. (2) The results are much lower considering the Soft mAP metric. The authors claim this is "due to suboptimal probability estimates". However, I'm not familiar enough with this task to fully understand their explanation.

其他意见或建议

NA.

作者回复

We sincerely thank the reviewer for your time and feedback. We appreciate the positive assessment of our well-written paper, good experiment results, and contribution to exploring the scaling law of the motion prediction large models.

We address your comments and questions below.


WOMD top-performing methods

We appreciate the reviewer for highlighting these methods that achieve top performance on WOMD.

MGTR leverages an augmented WOMD-LiDAR dataset that incorporates additional LiDAR inputs, which are not available in the standard WOMD dataset used by our method and most existing baselines in the literature (e.g., MTR, Wayformer, MotionLM). While we adhere to the standard WOMD dataset and use the open-sourced MTR encoder implementations for better reproducibility, our approach does not make any specific assumptions about encoder design. We believe our insights could be extended to additional input modalities and encoders in future work.

ControlMTR introduces a set of novel techniques to improve upon MTR. As shown in Table 4 of our paper and Table II in ControlMTR [B], our method outperforms ControlMTR in terms of minADE, minFDE, and MR.

We will ensure that discussions on these two methods, as well as additional relevant papers from the WOMD leaderboard, are included in the final version.


Table 4 baselines

We appreciate the reviewer’s comments on our WOMD baselines. While most of our baselines were published in 2023, Wayformer remains a high-ranked method on the WOMD 2024 leaderboard among all published and preprint papers.

Among all 30 methods on the WOMD 2024 leaderboard (https://waymo.com/open/challenges/2024/motion-prediction/), our minADE of 0.5240 and minFDE of 1.0538 are the second best, trailing only IMPACT_e2e, which was submitted in March 2025 (after our paper submission) and has no publication or preprint record as of today.

In the final version, we will include results from recently published papers and preprints, including [A] and [B] as referenced by the reviewer.


Soft mAP clarification

We discussed our soft mAP results in Section 5.2.3 (Page 8, Line 416). This limitation arises from using an autoregressive decoder to estimate sample weights, which are computed by accumulating probability estimates at each prediction step, as shown in Eq. (1).

More specifically, in WOMD, the model predicts 80 steps into the future to generate 8-second trajectory samples. Summing log probabilities over these steps can introduce compounding noise, leading to suboptimal probability estimates for each sample and lower soft mAP scores, which depend on accurate probability estimates across predicted samples.

While we follow the standard approach from the LLM literature for computing autoregressive sample weights, a potential venue for future work is to train an additional probability prediction head for each sample, which could enhance probability estimates and lead to improved soft mAP scores. We will further clarify this limitation with additional details in the final version.


Clarification of contribution

As we intend to demonstrate the generalizability of our method through WOMD experiments, the primary contribution of this work lies in providing a unique perspective to the community through our empirical scaling results, as acknowledged by the reviewer (“This paper explores the scaling law of the motion prediction large models”) and other reviewers (Reviewer mYMk: “The architecture is actually very simple. For an investigation into scaling laws this is positive to make it easier to attribute any performance gains”; “scaling driving performance is hard to estimate without large computational resources and datasets, which the authors both provided... that is an insight that is potentially very interesting for the community”. Reviewer i1gv: “this paper explores the effects of scaling in behavior modeling, which is highly valuable for the future development of autonomous driving”).

We hope our findings offer insightful contributions to the community and inspire further research in this direction.


Thank you again for your valuable feedback. We hope our responses sufficiently address your concerns and clarify the contributions and impact of our work. We would greatly appreciate your support in allowing us to share what we believe are meaningful insights with a broader audience and contribute to the advancement of next-generation foundation models for autonomous driving.

最终决定

This paper received three Weak Accept recommendations, with all reviewers expressing support for its publication. After carefully considering their assessments, the AC agrees with the positive consensus and recommends acceptance.

While the paper does not introduce major methodological innovations, all reviewers agree that the paper presents a valuable empirical investigation. The proposed approach shows moderate but consistent improvements over the state of the art. Importantly, the authors back their claims with substantial computational resources and large-scale datasets—critical components for evaluating performance in the context of autonomous driving. Although the findings suggest that increasing data and compute leads to diminishing returns, this is a meaningful insight for the community, especially in an era where scaling laws are often assumed to yield proportional gains. Overall, the study offers a relevant contribution and is likely to be of interest to the ICML audience.