SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction
We present a novel autonomous driving motion generation paradigm and validate scalability and zero-shot generalizability of proposed model.
摘要
评审与讨论
The main contributions of the paper include: 1) proposing a new motion generation framework that combines road and proxy trajectory labeling schemes with a decoder only Transformer for training the next token prediction task; 2) The model demonstrates zero sample generalization ability and scale law across different datasets; 3) SMART has achieved cutting-edge performance in most indicators, and the single frame inference time is controlled within 15 milliseconds, meeting the real-time requirements of interactive simulation for autonomous driving.
优点
SMART introduces a novel method for generating autonomous driving actions, which converts map and trajectory data into sequence tokens and uses a Transformer structure for prediction, achieving effective modeling of real driving behavior.
In terms of multiple metric, SMART ranks among the top in the Waymo Open Sports Dataset ranking, especially in terms of inference speed, demonstrating efficient performance and zero sample generalization ability.
缺点
-
The scale law ability of models based on transformer structure has been proven in many papers, resulting in limited novelty of the method proposed in this paper.
-
The result in "Table 4: Absorption study on each component of SMART" indicates that "RVT", "NAT", and "NRVT" are harmful for models trained on WOMD.
问题
- can the author provide results on NuPlan and compare with the latest works?
局限性
The paper explores motion prediction in GPT-style network, which has been done in many works such as StateTransformer. I recommend that the authors should include more discussion about the differences and improvements compared to the previous works.
Q1: ”The paper explores motion prediction in GPT-style network, which has been done in many works such as StateTransformer.” + “The scale law ability of models based on transformer structure has been proven in many papers, resulting in limited novelty of the method proposed in this paper”
A1: Thank you for your valuable feedback. We will reference related work in the introduction of the revised manuscript and provide additional clarification.
1. While StateTransformer is indeed an autoregressive model, similar models like MVTE also exist. These autoregressive models utilize diffusion or distribution regression frameworks but do not implement discrete tokenization for map and agent input features or employ a cross-entropy based next token prediction paradigm.
2. Furthermore, prior methods like StateTransformer have limitations in validating scaling laws because both training and testing datasets are based on the same dataset. This increases the likelihood of overfitting to specific datasets, resulting in better performance. In contrast, validation of scaling laws and generalization abilities in LLM[1] and vision domain[2] often requires completely independent testing datasets. Our paper states in the introduction: "Generalizability means achieving satisfactory results across diverse datasets through zero-shot and few-shot learning, while scalability involves improving model performance as dataset size or model parameters increase." Therefore, our experiments on scalability and generalization are the first to be validated across multiple datasets. Additionally, as shown in the results in our note to all reviewers Q2, we believe that the combination of cross-entropy autoregressive prediction and discrete tokenization for both map and agent features is crucial for achieving scalability and generalization across datasets, which represents our key contribution.
Q2: "The result in "Table 4: Absorption study on each component of SMART" indicates that "RVT", "NAT", and "NRVT" are harmful for models trained on WOMD."
A2: We would like to clarify that the ablation study in Table 4 shows that while discrete tokenization may lead to some information loss, this loss can negatively impact performance on a single dataset. However, it significantly enhances generalization across different datasets, which aligns with our expectations. Moreover, techniques like "NAT" and "NRVT" can effectively improve model performance after discretization of inputs. We also emphasize in the conclusion the importance of lossless discretization for map inputs in future research.
Q3: "can the author provide results on NuPlan and compare with the latest works?"
A3: Thank you for the valuable suggestion. Below are the results of our experiments, which are fully aligned with the Val14 benchmark in PLUTO [3]. Due to character count limitations, we regret that we cannot cite every baseline method. The results in the table demonstrate the performance of our multi-agent SMART model when directly transferred to the NuPlan challenge, with a particular focus on pure data-driven methods. Notably, SMART achieves the highest Planner Score (90.17), Progress (99.52), and Drivable (99.33) among the Pure Learning methods, outperforming other models in several metrics.
| Pure Learning | Planner | Planner Score | Collisions | TTC | Drivable | Comfort | Progress | Speed |
|---|---|---|---|---|---|---|---|---|
| PDM-Open | 50.24 | 74.54 | 69.08 | 87.89 | 99.54 | 69.86 | 97.72 | |
| GC-PGP | 61.09 | 85.87 | 80.18 | 89.72 | 90.00 | 60.32 | 99.34 | |
| RasterModel. | 66.92 | 86.97 | 81.46 | 85.04 | 81.46 | 80.60 | 98.03 | |
| UrbanDriver | 67.72 | 85.60 | 80.28 | 90.83 | 100.00 | 80.83 | 91.58 | |
| PlanTF | 85.30 | 94.13 | 90.73 | 96.79 | 93.67 | 89.83 | 97.78 | |
| PLUTO (w/o post.) | 89.04 | 96.18 | 93.28 | 98.53 | 96.41 | 89.56 | 98.13 | |
| SMART (w/o post.) | 90.17 | 95.29 | 90.30 | 99.33 | 99.90 | 99.52 | 97.28 |
To ensure these results are reproducible, we will update the code for Nuplan challenge accordingly. However, as noted in the conclusion of the original paper, "As a motion generation model, the ability of SMART to migrate to planning and prediction tasks still needs to be verified, and this is our top priority for future work." The primary reason for not including this experiment in the paper is that, unlike sim-agent tasks, the characteristics of planning tasks have not been fully considered, such as the need for more ego vehicle information input and a greater emphasis on driving safety.
[1] Achiam, Josh, et al. "Gpt-4 technical report." arXiv preprint arXiv:2303.08774 (2023).
[2] Bai, Yutong, et al. "Sequential modeling enables scalable learning for large vision models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[3] Cheng, Jie, Yingbing Chen, and Qifeng Chen. "PLUTO: Pushing the Limit of Imitation Learning-based Planning for Autonomous Driving." arXiv preprint arXiv:2404.14327 (2024).
Thanks for authors' response, which addresses my concern. I would like to raise my score to weak accept (6).
Thanks for clearly addressing my concerns, I would like to maintain my current Rating.(6) and raise the "contribution" score.
In this paper, a GPT-style motion generator is developed for scalable multi-agent simulation. Though various techniques such as motion tokenization, factorized agent attention, and next map segment prediction, the proposed SMART framework ranked 1st place on WOSAC leaderboard for meta metric. Further zero-shot and scalablity analysis further manifest the generizability for proposed SMART framework.
优点
- A unified token design for motion and map, with straightforward decoder-only GPT structure in agent simulation.
- Solid performance on WOSAC leaderboard for meta metric over other sim agent methods.
- Comprehensive analysis on scalabllity and zero-shot transferrability.
缺点
- Minor in methodology differentiating with Trajeglish[1] and MotionLM[2], other auto-regressive decoder for motion prediction or sim agent.
- Lack of RoadNet evaluation on the road-vector NTP task.
- Minor in the experimental details, such as clearer auto-regressive sampling process, sim performance for different scale, etc.
- A thorough check for notations and writing. For instance, I found a suspected ChatGPT sentence: "Here’s the revised version of your text with improved precision and grammar:" in line 508; Also, the notation for modality should be replaced to differentiate with "Value"
Reference:
[1] Philion, J., Peng, X. B., & Fidler, S. (2024). Trajeglish: Traffic Modeling as Next-Token Prediction. In The Twelfth International Conference on Learning Representations.
[2] Seff, A., Cera, B., Chen, D., Ng, M., Zhou, A., Nayakanti, N., ... & Sapp, B. (2023). Motionlm: Multi-agent motion forecasting as language modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8579-8590).
问题
-
Whats the difference in decoder design compared with Trajeglish? Please clarify.
-
What is the defination and learning process for road-vector NTP? Whats the performance of SMART without this augmentation?
-
How is the inference process for SMART? Does any sampling trick conducted in SMART?
局限性
N/A.
We sincerely thank the reviewer for providing thoughtful review and constructive suggestions. We answer the questions as follows.
Q1 : What's the difference in decoder design compared with Trajeglish? Please clarify.” + “Minor in methodology differentiating with Trajeglish[1] and MotionLM[2], other auto-regressive decoder for motion prediction or sim agent.”
A1: We appreciate your feedback, and we will include a discussion of these differences in the revised manuscript.
- In addition to the discrete tokenization of road and the decoder-only transformer structure emphasized in the paper, there are also structural differences in the decoder. As described in Section 3.2 of our paper: “In our work, we leverage a factorized Transformer architecture with multi-head cross-attention (MHCA) to decode complex road-agent and agent-agent relationships along the time series.”. This approach separates spatiotemporal attention into two dimensions, unlike Trajeglish and MotionLM, which compresses the spatiotemporal dimensions into a single sequence. We will add the differences to the Related Work in the revised version.
- This design difference significantly affects inference efficiency. In our model, all vehicles make simultaneous decisions for the next time step, while MotionLM and Trajeglish sequentially output each agent's predictions at the same time. For example, generating the future scenes of 32 agents over 8 seconds at a 2 Hz frequency requires only 16 steps of autoregressive inference with our model. In contrast, the latter requires 32 * 16 = 512 autoregressive inference steps. This efficiency is a key reason why our model can complete multi-agent interactive simulations within 20 ms per step. We will include similar explanations in the discussion of the inference section in the revised version.
Q2: ” Minor in the experimental details, such as clearer auto-regressive sampling process, sim performance for different scale, etc.”
A2: Thank you for your feedback. In the initial version of the manuscript, a description of the inference process is provided in Appendix A.1, line 495, and we conducted performance evaluations of different scale models in Table 9, line 545. We acknowledge that the previous version's brief description of the inference process may have led to your confusion.
1. We will add the following clarification in the revised version of Appendix A.1: Specifically, our model operates in 2Hz, and we interpolate the results to 10Hz for evaluation. We use 1 second of history along with the current frame as conditions and predict the information for the next 8 seconds. We can further factorize the traffic simulation task as:
where is the set of all states for all agents at timestep . As indicated by the formulas, our approach samples the future motion for all vehicles in the current scenario simultaneously for the next time step. In contrast, previous methods like Trajeglish and MotionLM sequentially sample each vehicle's future motion for the next time step. The main reasons for our approach are twofold: first, it significantly increases the model's inference efficiency by a factor of , where is the number of agents in the scene. Second, in real traffic interactions, it is challenging to define a reasonable sequence of vehicle interaction at the same time, as vehicles typically plan their intentions concurrently.
2. Regarding the sampling process, we do not employ complex strategies. As stated in line 502, "To balance realism and diversity, we use top-5 sampling at every step during the simulation. Given the focus of this article on the generalization and scalability of the model, we have achieved strong results in specific scene generation without extensively exploring detailed sampling tricks.”
Q3: "Lack of RoadNet evaluation on the road-vector NTP task” +“Whats the performance of SMART without this augmentation? "
A3: We have included relevant explanations in our note to all reviewers Q1 and will make appropriate modifications in the revised manuscript.
Q4: “What is the defination and learning process for road-vector NTP?“
A4: Unlike sequential agent motions, road vectors form a graph. To address this issue, we extract the original topological information of roads and model the road vector tokens with sequential relationships based on their predecessor-successor connections. This approach requires RoadNet to understand the connectivity and continuity among unordered road vectors. The loss function for a single tokenized road polyline is defined as:
where denotes the categorical distribution predicted by the RoadNet parameterized by , represents a complete polyline that has not yet been split into road vector tokens, Representing the road token embedding of the predecessor, and is the next predicted road vector token. This loss function ensures that RoadNet learns to predict the correct next road vector token given the preceding tokens, thereby capturing the spatial continuity and connectivity within the road network. The loss function formula and explanations will be included in the revised version of the paper.
Q5: A thorough check for notations and writing"
A5: We apologize for the oversight. We will correct the noted issues in the revised version, including the suspected ChatGPT sentence in line 508 and the notation for modality to clearly differentiate it from "Value."
Thanks for clearly addressing my concerns, I would like to maintain my current Rating.(6) and raise the "contribution" score.
This paper presents SMART, a model for multi-agent traffic simulation. The approach is based on a decoder-only transformer architecture that predicts all agents motion tokens autoregressively over time. The architecture makes use of factorized attention layers (over map, agents, and time) and relative positional encodings between agent and map tokens. As in Trajeglish, SMART uses the K-disk algorithm to tokenize agent motion trajectories. However, SMART also tokenizes road vectors and does next-token prediction for road tokens as well for pre-training. Experiments on the Waymo Open Sim Agents Challenge shows that SMART achieves state-of-the-art performance. In addition, SMART's tokenization strategy allows it to generalize better between datasets (NuPlan and WOSAC).
优点
- This paper presents a simple architecture for multi-agent traffic simulation that achieves state-of-the-art results in WOSAC. The architecture is quite similar to Trajeglish. However, SMART uses a decoder-only architecture, also tokenizes the road vectors, and uses factorized self-attention with relative positional encodings. In the paper, the authors demonstrate the efficacy of tokenizing road vectors for generalizing between datasets.
- The scaling law and zero-shot generalization experiments are interesting. To my knowledge, they are the first experiments of their kind in multi-agent traffic simulation. The zero-shot experiments also demonstrate the efficacy of tokenizing road vectors and using noised tokens for data augmentation during training, which validate the authors' design choices and provide interesting insights for others in the field.
- The ablation studies offer interesting insight into SMART's design choices; e.g., Table 7.
- Code is available as a part of the submission.
- The paper is generally well-written and easy-to-follow.
缺点
- Without comparisons to other methods, it is difficult to interpret the significance of the scaling law and zero-shot generalization experiments. For example, while it appears that SMART's performance scales with its number of parameters, we cannot conclude whether it does so better than other architectures. Likewise, while SMART can zero-shot generalize from NuPlan to WOSAC, we cannot conclude whether it does so better than other architectures. This limits the significance of these experiments.
- The paper's results can be made stronger with more statistically significant experiments. For example, in Table 9, it is unclear whether the differences between SMART 8M, 36M, 96M are statistically significant or within the bounds of noise, given how close the numbers are.
问题
- How much does road vector next token prediction contribute to SMART's performance?
- In Table 1, why is SMART's minADE higher than that of Trajeglish? This doesn't seem to match the results in Table 6 of the appendix.
- Table 4 shows the efficacy of certain architecture choices on generalizing from one dataset to another. How do these architecture choices affect the model's ability to learn from multiple datasets? Do these choices matter in the setting where you have access to NuPlan, WOSAC, and the proprietary dataset?
- L308: Why does dataset size limit your architecture size to 100 million parameters? Is this a matter of overfitting? If so, I think that it would be interesting to make note of this in the appendix of your paper.
Other:
- In L508, there may be an unintended sentence "Here’s the revised version of your text with improved precision and grammar:"
局限性
The paper adequately addresses limitations.
We sincerely thank the reviewer for providing thoughtful review and constructive suggestions. We answer the questions as follows.
Q1: How much does road vector next token prediction contribute to SMART's performance?
A1: We have included relevant explanations in our note to all reviewers and will make appropriate modifications in the revised manuscript.
Q2: In Table 1, why is SMART's minADE higher than that of Trajeglish? This doesn't seem to match the results in Table 6 of the appendix.
A2: We appreciate your attention to this detail. The minADE in Table 1 reflects the correct evaluation results from the WOSAC 2023 leaderboard. We mistakenly reported the minADE metric from WOSAC 2024 in Table 6. We appreciate your understanding and have corrected this in the revised manuscript.
Q3 :"Table 4 shows the efficacy of certain architecture choices on generalizing from one dataset to another. How do these architecture choices affect the model's ability to learn from multiple datasets? Do these choices matter in the setting where you have access to NuPlan, WOSAC, and the proprietary dataset” +“Without comparisons to other methods, it is difficult to interpret the significance of the scaling law and zero-shot generalization experiments. “
A3: Thank you very much for your feedback. We agree that including relevant results can enhance the reliability of our main contributions. We have added these results in our note to all reviewers Q2. We regret that due to the high GPU cost of training large-scale models across datasets, we were unable to conduct ablation comparisons for each sub-module during our validation of scaling laws. We only compared SMART W/O, SMART, and the earlier replicated MVTE method. Nevertheless, the existing results demonstrate that discrete tokenization is an effective method for bridging dataset gaps. Additionally, autoregressive models utilizing cross-entropy classification loss are crucial for scalability.
Q4: “Why does dataset size limit your architecture size to 100 million parameters? Is this a matter of overfitting? If so, I think that it would be interesting to make note of this in the appendix of your paper.”
A4: This conclusion is based on our observations during training. Specifically, we found that for models with 10 million parameters, performance improvements plateau when the total training token count reaches around 0.1 billion. For 30 million parameter models, this plateau occurs after training on approximately 0.7 billion tokens, while the 100 million parameter model continues to show improvements with full data. Due to constraints in training resources and dataset size, we did not pursue training larger models. Regarding the potential for overfitting with larger models, our experience suggests that when a model with a higher parameter count is trained on a single dataset for multiple epochs, its generalization ability tends to decline. We will include this discussion in the appendix of the revised manuscript.
Q5: “ There may be an unintended sentence "Here’s the revised version of your text with improved precision and grammar:".
A5: We apologize for the oversight. This sentence was unintentionally included and will be removed in the revised version of the paper.
Q6: “The paper's results can be made stronger with more statistically significant experiments. For example, in Table 9, it is unclear whether the differences between SMART 8M, 36M, 96M are statistically significant or within the bounds of noise, given how close the numbers are“
A6: We appreciate your suggestion regarding the need for more statistically significant experiments. Given the extensive test dataset of 44,920 samples in the actual WOSAC evaluations, statistical noise is minimal, with metric fluctuations typically around ±0.02. Furthermore, the models with different scales are already performing very close to the maximum possible score on the Waymo sim agent metrics (where the ground truth maximum score is 0.80), meaning the differences in performance metrics have become less pronounced. Consequently, the performance improvements between SMART 8M, 36M, and 96M models may not appear as significant due to these near-maximal scores.
Thank you for addressing my questions. Since I have no further concerns, I would like to maintain my "accept" rating.
Thank you very much for acknowledging our additional experiments and providing positive feedback! Your constructive comments and suggestions are very helpful in improving our paper quality. Thanks!
We thank all reviewers for their reviews. We are incorporating feedback into our paper and will post direct responses to each reviewer's comments and questions. First, we would like to address the common concerns raised by multiple reviewers:
Q1: How much does road vector next token prediction contribute to SMART's performance? + “Lack of RoadNet evaluation on the road-vector NTP task”
A1: We appreciate the insightful question regarding this aspect. In our original Table 4, the NRVT was indeed evaluated alongside the road vector next token prediction. To address this, we conducted ablation studies to isolate the impact of this component.
| Model Number | RVT | NAT | NRVT | RVNTP | kinematics (WOMD) | interactive (WOMD) | map (WOMD) | kinematics (NuPlan) | interactive (NuPlan) | map (NuPlan) |
|---|---|---|---|---|---|---|---|---|---|---|
| M1 | 0.459 | 0.827 | 0.857 | 0.376 | 0.593 | 0.603 | ||||
| M2 | ✓ | 0.434 | 0.807 | 0.840 | 0.389 | 0.696 | 0.724 | |||
| M3 | ✓ | ✓ | 0.448 | 0.809 | 0.848 | 0.413 | 0.750 | 0.743 | ||
| M4 | ✓ | ✓ | ✓ | 0.437 | 0.801 | 0.837 | 0.411 | 0.747 | 0.741 | |
| M5 | ✓ | ✓ | ✓ | 0.453 | 0.813 | 0.853 | 0.413 | 0.780 | 0.785 | |
| M6 | ✓ | ✓ | ✓ | ✓ | 0.453 | 0.803 | 0.851 | 0.416 | 0.785 | 0.797 |
Our comparison between models M4 and M3 indicates that introducing the NRVT module alone harms the model's overall performance. This finding contrasts with the positive effects observed with NAT on model performance. We hypothesize that this discrepancy arises from a lack of a dedicated training task in map representation learning that guides the model to enhance its understanding of map information. Additionally, the comparison among M4, M5, and M6 shows that the combination of these elements leads to an improvement in overall model performance.
Q2: “Without comparisons to other methods, it is difficult to interpret the significance of the scaling law and zero-shot generalization experiments. “ + “The scale law ability of models based on transformer structure has been proven in many papers, resulting in limited novelty of the method proposed in this paper”
A2: Thank you for your valuable feedback. We recognize the importance of comparative methods to contextualize the significance of our scaling law and zero-shot generalization experiments. Initially, we replicated the MVTE method on the simagent task, and the relevant results are included in the attached PDF.
The term "SMART w/o" refers to the SMART model without the road vector tokenization and noise strategies proposed in this paper. To ensure fairness in our experiments, we adjusted all model parameters to the 90-100M range. An interesting observation is that, although our proprietary dataset contains more data than the NuPlan dataset, the performance of MVTE trained on our dataset was inferior to that on NuPlan. This suggests that models based on distribution regression may overfit to specific datasets. From the SMART w/o results, it is evident that the model's generalization performance is limited. However, including incremental data improves performance compared to using a single training dataset. Our findings indicate that discrete tokenization is an effective method for bridging dataset gaps. Additionally, autoregressive models utilizing cross-entropy classification loss are crucial for scalability, paralleling the significant scaling capabilities observed in large language models (LLMs). We plan to supplement this section with additional experiments in the appendix of the revised manuscript.
This submission introduces SMART, a novel autonomous driving motion generation approach that tokenizes map and trajectory data, processes it through a transformer architecture, and demonstrates state-of-the-art performance, zero-shot generalization capabilities, and scalability across multiple datasets. There are three reviews vote to accept the submission, although concerns do exist in the novelty of the approach given that it is based on scaling law which has been justified many times. The experiments are comprehensive and the results are convincing. After careful consideration of the reviews, the AC decides to accept the submission.