SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout
摘要
评审与讨论
This paper provides a diffusion-based method for traffic simulation, called SceneDiffuser. SceneDiffuser unifies simulation, including scene initialization and scene rollout for generating initial scene layouts and simulating closed-loop agent behavior. To accelerate the time-cost diffusion process, this paper introduces an amortized diffusion technique to accelerate the process with 16x fewer inference steps. The paper further introduces LLM for few-shot prompt engineering. For verification, the SceneDiffuser achieves good performance on Waymo Open Sim Agent Challenge, top performance on both open-loop and close-loop setting.
优点
-
I think it is good introducing diffusion mechanism to the topic of traffic simulation. Previous methods, like MTR++ or TRAJEGLISH typically use transformer architecture and formulate it as a regression or auto-regressive regression problem.
-
The extension to LLM looks interesting to me.
缺点
-
I think the writing can be improved.
a. It is hard to understand Algorithm 1-3 given so many details and notations without explanation.
b. The role of Figure 4 is to illustrate the auto-regressive process, or not? It confuses me when reading the figure.
c. The amortized diffusion part is hard to understand.
-
The motivation of the LLM extension is unclear to me. For me, it seems like an application of the SceneDiffuser, and has nothing to do with the improvements of SceneDiffuser to existing traffic simulators.
a. It would be interesting to see how LLM + SceneDiffuser helps autonomous driving.
-
Missing experiments:
a. How about the proposed method compared to SceneTransformer [1], especially in terms of performance. For me, I think the two papers share some insights and should be compared fairly.
[1]. Scene Transformer: A unified architecture for predicting multiple agent trajectories
问题
Please see weaknesses.
局限性
N/A
Thank you for your thoughtful and constructive feedback. We address the comments below:
I think the writing can be improved. a. It is hard to understand Algorithm 1-3 given so many details and notations without explanation. b. The role of Figure 4 is to illustrate the auto-regressive process, or not? It confuses me when reading the figure. c. The amortized diffusion part is hard to understand.
Thank you for your feedback. With the additional 1 page of space in the camera-ready, we have added additional clarifying details regarding the algorithms by adding additional comments in the algorithm pseudocode and by correcting a typo in the notation. We note that we do have positive feedback from other reviewers about the figures such as “The figures are very illustrative and can help understanding” and “The paper is easy to understand in an overall sense” (Reviewer xCJf). If there is more detailed feedback for a specific part of Figure 4 or the algorithms that is difficult to understand, we would be happy to add more detail and clarifications.
Regarding notation, the notations in Algorithm 1-3 are consistent with the notations defined in Sec 3.1 (Scene Diffusion Setup) as well as Sec 3.2 (Scene Rollout), lines 164 - 169. Each of the Algorithms 1-3 are kept in under 8 lines of pseudo-code. Regarding Figure 4 -- that is correct that it illustrates the improved autoregressive process in diffusion models, which we refer to as Amortized Diffusion, as stated in the figure caption. We created the schematic in Figure 4 to make it easier to understand Algorithm 3 which details the amortized diffusion process.
The motivation of the LLM extension is unclear to me. For me, it seems like an application of the SceneDiffuser, and has nothing to do with the improvements of SceneDiffuser to existing traffic simulators. a. It would be interesting to see how LLM + SceneDiffuser helps autonomous driving.
LLMs can be leveraged to simplify the scenario creating process using SceneDiffuser. While SceneDiffuser allows controllable inference by specifying the location, type, size etc of certain agents at certain waypoint steps and generating the scene in an inpainting-style, defining the controllable waypoints itself is non-trivial. Connecting the common sense world knowledge of LLMs to the SceneDiffuser via the proto interface in a few-shot approach, and experimentally validating its feasibility, is a non-trivial finding that is worth sharing in the paper, and helps with integrating the model with LLM interfaces.
Missing experiments: a. How about the proposed method compared to SceneTransformer [1], especially in terms of performance. For me, I think the two papers share some insights and should be compared fairly. [1]. Scene Transformer: A unified architecture for predicting multiple agent trajectories
Thanks for the suggestion and reference. We have added a citation now to SceneTransformer in the introduction where we discuss behavior prediction methods. SceneDiffuser shares some similarities with SceneTransformer regarding using masking to direct the same model towards different tasks. However, SceneTransformer is only applied to the behavior prediction task (predicting motion at all future timesteps in a single model inference) and is not applied to either scene generation or closed-loop simulation, so we do not believe it is applicable for a direct task-level comparison.
I appreciate the authors' response and will maintain my score.
The paper proposes a diffusion model data driven simulation of driving scenes. The model is able to both generate driving scenes and to perform closed loop simulation of these driving scenes. The problem is posed as an inpainting task in the scene tensor, which contains all the agents, time steps (past and future) and road user attributes/states. To avoid excessive compute when using the diffusion model in a receding horizon fashion (autoregressive rollout) the paper proposes to reuse the solution of the last iteration (similar to nonlinear MPC) and use a custom noiseing strategy that adds more noise to time steps further in the future. The resulting solution is competitive with other diffusion based methods at a fraction of the compute and reusing the previous trajectory as a warm start allows for more consistent trajectories over time.
优点
The computational burden of diffusion policies is in my opinion one of their main drawbacks. The proposed amortized diffusion allows to reduce this burden while resulting in more consistent trajectories over time.
By formulating the problem as an inpainting problem the method is general and can tackle several interesting tasks.
缺点
- By relying on the previous solution, there seems to be a risk of getting stuck in local minima. Did you notice such situations?
- Implementing hard constraints by projection seems risky, especially in the amortized rollout phase where only one denoising step is taken. I can imagine that for some types of constraints this works well but others will be problematic. Similar to optimization problems where bounds can be handled with clipping but more general constraints need more elaborate implementations (e.g. barrier functions in interior point methods). Did you notice that generalized hard constraints worked better or worse for some types of constraints?
问题
- How is it enforced that the size and type of a vehicle do not change over time?
- Did you investigate taking several denoising steps in the rollout phase. In nonlinear MPC this can often improve the results if the computation load allows it and reduce the dependency on the previous trajectory.
- Why is the offroad rate not lower, is it not possible to constrain the motion to onroad predictions?
- What is the run time of the model in the rollout phase, time for one denoising step?
- The WOMD setup has a horizon of 8s, in the rollout phase do you always predict the full 8 seconds or do you reduce the horizon length. If you keep the 8s horizon how do you deal with the horizon potentially leaving the map. Similarly if you would reduce the prediction horizon to about 4s how would this change the results?
Typo:
- L261: notedly - notebly
- L18: We demonstrate of effectiveness - We demonstrate the effectiveness
局限性
Limitations have been adressed
Thank you for your thoughtful and constructive feedback. We address the comments below:
By relying on the previous solution, there seems to be a risk of getting stuck in local minima. Did you notice such situations?
Thank you for this question. Even though we utilize the previous timestep’s solution, the previous predictions are progressively noised, with a higher noise the further out the time step is. This allows the trajectories to still vary to a great extent. While the immediate next step is only noised by 1/T (where T is the total length of the future horizon), it is reasonable to assume that the object cannot veer too far from its current position in one step to begin with. Our competitive simulation realism metrics is another indirect evidence that this is not a common occurrence.
Implementing hard constraints by projection seems risky, especially in the amortized rollout phase where only one denoising step is taken. I can imagine that for some types of constraints this works well but others will be problematic. Similar to optimization problems where bounds can be handled with clipping but more general constraints need more elaborate implementations (e.g. barrier functions in interior point methods). Did you notice that generalized hard constraints worked better or worse for some types of constraints?
Great thoughts. In our experiments we applied hard constraints in the unconditional SceneGen experiment which jointly denoises all past and future steps from the same noise level in a one-shot fashion, therefore these clipping constraints are applied across up to 32 steps of denoising. This would be interesting to try in amortized rollouts, and based on this comment it would make sense to start applying clipping constraints only on steps k-steps in the future, since future steps go through more iterations of denoising before being finalized. We found that the basis on which the hard constraints operate is important: a good constraint will modify a significant fraction of the scene tensor, or else the model effectively "rejects" the constraint on the next denoising step. For example, to correct for collisions we shift an agent's entire trajectory rather than shifting just the overlapping waypoints so as to maintain a more realistic (and non-colliding) trajectory for the diffusion process. We will further clarify these details in the final paper.
How is it enforced that the size and type of a vehicle do not change over time?
Object size is treated similarly as all other features such as position and yaw, therefore there is no hard constraint that enforces it to be constant. However, even in logged data object sizes for each agent are in fact not strictly constant due to the existence of perception / detection noise. We believe that learning this small fluctuation of perception object features due to perception noise further improves the realism of the simulation.
Did you investigate taking several denoising steps in the rollout phase. In nonlinear MPC this can often improve the results if the computation load allows it and reduce the dependency on the previous trajectory.
We did not look into different denoising schedules across future timesteps and only looked into a linear noise schedule from 0 to 1 from the current to final time step to reduce simulation cost (single denoiser evaluation per simulation step). However, this is a good idea and very recent work (Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion) has looked into arbitrary per-step noise schedules with encouraging results.
Why is the offroad rate not lower, is it not possible to constrain the motion to onroad predictions?
In the WOMD dataset there are many vehicles that are located "offroad", such as in parking lots or driveways. Consequently the offroad metric measures a model's ability to produce such agents at the appropriate rate. SceneDiffuser learns to produce both on-road and offroad agents, but much headroom remains. We explored using an onroad constraint (see Appendix A.7) to force individual agents to be on-road or off-road, but we found this did not significantly improve the offroad metric. The problem is not that an on-road agent goes offroad; the problem is that SceneDiffuser produces offroad agents at the wrong rate relative to logs, which we have been successful at using more generic methods such as architecture scaling (see Table 1).
Thanks a lot for addressing my concerns. I would like to keep my rating.
The paper proposes a method called SceneDiffuser, to generate multi-agent scenarios for autonomous driving and rollout the scenarios. The two tasks are unified in a model by formulating them as an inpainting task for the scene tensor. The diffusion model uses an amortized diffusion technique to align the diffusion steps with the physical prediction timestamp, enhancing the efficiency. Additionally, language-based agents can be inserted in the scene tensor and rule-based constraints are applied in the diffusion procedure. The scene rollout task is evaluated on the Waymo Sim Agent task and achieves comparable results with SOTA auto-regressive methods, and the scene generation task is validated on the Waymo Motion dataset to realize great realism.
优点
- Unifying the scene generation and simulation rollout in a single model is great and relatively new. As far as I know, some models can use post-processing to realize these two closely related tasks while there are no unified models previously.
- The figures are very illustrative and can help understanding. The paper is easy to understand in an overall sense.
- SceneDiffuser adopts the amortized diffusion strategy from the human animation literature. It eases the efficiency problem in simulation rollout.
- The overall performance is great. Scalability is validated as well. Hard constraints are injected during diffusion, to solve the problem of realism.
缺点
I do not see critical problems, but there are a few things worth discussion and improvement.
- Metrics. It lacks descriptions or some simple introductions about the metrics for those experiments. For example, the statement in Line 234 ("different metrics buckets are aggregated per-scene instead of per-agent") could be further elaborated for better readability. The scene generation task uses some different metrics, compared to previous scene generation methods, eg [8,a]. Detailed introductions and discussions on these issues would be ideal.
- The layout and the order of the tables and figures, can be further improved. Figures and tables are not ordered by the reference order and are placed close to where they are referenced. I acknowledge that it could be difficult to perfectly place them considering the amount and different sizes. However, the current layout and order pose difficulties in reading the paper, making the reading process not smooth.
- Though the paper focuses on the scene generation and simulation rollout, similar to previous related literature. If the generated scenes can be utilized to help motion prediction or even other downstream tasks, the impact of the paper would be broader.
- There are some related works about scene generation not discussed, eg [a-c]. I am also wondering if those using WMOD could be compared quantitatively. This question also relates to the metrics.
- The proposed method still can have realism problems, even though constraints are injected in the diffusion procedure. This internal drawback of diffusion methods could not be fully relieved.
- Typos:
- Line 18: We demonstrate
- Line 112: Appendix A.2 should be Appendix A.4.
[a] Language Conditioned Traffic Generation. CoRL 2023.
[b] DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch. RA-L, 2024 (arXiv 2309).
[c] CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios using Real-World Trajectories. arXiv 2403.
问题
- Would amortized diffusion strategy have problems (such as different diffusion steps) on the boundary of the prediction window?
- Is the predicted size for one specific agent not consistent during the prediction horizon?
局限性
Yes.
Thank you for your thoughtful and constructive feedback. We address the comments below:
Figures and tables are not ordered by the reference order and are placed close to where they are referenced.
Thank you for the suggestion, we will improve the layout in the camera-ready version.
Missing / additional citations.
We do cite Language Conditioned Traffic Generation, but the year was malformatted in the BibTeX entry, which we have addressed now. Thank you for suggesting the two other works, we have now added a reference in our Related Work.
Would amortized diffusion strategy have problems (such as different diffusion steps) on the boundary of the prediction window?
Great question. Due to the diffusion noise schedule being a linear ramp from 0 to 1 going from the current step to the boundary of the prediction window, the step at the prediction window is associated with maximum (full) noise. Therefore simply appending a random gaussian vector is sufficient for that step. As the rollout proceeds, we iteratively append a random gaussian vector at the end of each step to extend the simulation.
Is the predicted size for one specific agent not consistent during the prediction horizon?
Object size is treated similarly as all other features such as position and yaw, therefore there is no hard constraint that enforces it to be constant. However, even in logged data the object sizes for each agent are in fact not strictly constant due to the existence of perception / detection noise. We believe that learning this small fluctuation of perception object features due to perception noise further improves the realism of the simulation.
Metrics. It lacks descriptions or some simple introductions about the metrics for those experiments. For example, the statement in Line 234 ("different metrics buckets are aggregated per-scene instead of per-agent") could be further elaborated for better readability.
We apologize that the metrics descriptions are not sufficiently detailed due to the submission page limit. We have now added a comprehensive technical description of the metrics to the appendix and will add this to the camera-ready version given the additional page allowance.
The metrics used in the unconditional scenegen task are minor variants for the metrics in the Waymo Open Sim Agents Challenge (WOSAC). The core idea of the WOSAC metrics is to measure the negative log likelihood (NLL) of the ground truth logged scene under the distribution from the generated samples. The NLL is computed over 9 measurements: kinematic metrics (linear speed, linear acceleration, angular speed, angular acceleration magnitude), object interaction metrics (distance to nearest object, collisions, time-to-collision), and map-based metrics (distance to road edge, and road departures). A weighted average over the NLLs across the 9 measurements is then computed as the final composite score. However, the NLLs in WOSAC are computed on a per-agent granularity. That means that each logged agent’s log likelihood is measured under the distribution of 32 samples in the predictions for the same agent. This can be done because there exists a one-to-one correspondence between each simulated agent and each logged agent (since they share the same history). However in unconditional scenegen, there does not exist a one-to-one mapping between each logged agent and each generated agent. Therefore when measuring the NLL, we flatten (num_agents, num_steps * num_samples) into (num_agents * num_steps * num_samples,) and compute the histograms per-scene by scrambling all agents and all timesteps into the same histogram.
WOSAC Metrics: Suppose there are scenarios, each of length steps, each containing agents (objects). For each scenario, we generate samples (conditioned on the true initial state), which is a set of trajectories for each object for each time step, where each point in the trajectory is a -dim vector recording location and orientation . Let all this generated data be denoted by ). Let the ground truth data be denoted ). Below we discuss how to evaluate the likelihood of the true (test) dataset under the distribution induced by the simulated dataset .
(Note that we may have , since the ground truth (GT) can contain cars that enter the scene after the initial prefix used by the simulator; this is handled by defining a validity mask, , which is set to 0 if we want to exclude a GT car from the evaluation, and is set to 1 otherwise.)
Rather than evaluating the realism of the full trajectories in the raw state space, WOSAC defines M=9 statistics (scalar quantities of interest) from each trajectory. Let represent the set of statistics/features (of type j) derived from by pooling over . This is used to compute a histogram for the empirical distribution of for scenario i. Let be the value of this statistic from the true trajectory for vehicle at time . Then we define the negative log likelihood to be
The j'th metric for scenario i is defined as
Finally an aggregated metric to rank entries is computed as
where .
SceneGen Metrics: We instead let represent the set of statistics/features (of type j) derived from by pooling over . This is used to compute a histogram for the empirical distribution of for scenario i.
Thanks for the clarifications.
- Please incorporate the details about metrics in the revision. High-level descriptions would be good in the main paper and you can put more details in the appendix, given the limited space.
- The potentially inconsistent objects' size prediction is a little weird. However, this can probably be mitigated by certain post-process.
- Concerning the more related works, I was curious if those provided WOMD results can be compared quantitatively.
I read other reviewers' comments and authors' replies, and I also agree with their concerns such as SOTA comparisons and LLM's integration. Generally, the contributions of this work are technically solid.
One last thing, it seems there will be no open-source code for this work. This is a pity. Official code can greatly promote the development of the community.
The paper introduces SceneDiffuser, a novel scene-level diffusion model designed to enhance traffic simulation for autonomous vehicle (AV) development. It presents a unified framework that addresses scene initialization, involving the generation of initial traffic layouts, and scene rollout, which includes the closed-loop simulation of agent behaviors. SceneDiffuser leverages diffusion models to learn realistic and multimodal agent distributions, focusing on controllability, realism maintenance in simulations, and inference efficiency. The model introduces amortized diffusion for simulation, reducing computational costs and mitigating closed-loop errors. Additionally, it enhances controllability through generalized hard constraints and language-based constrained scene generation using large language models (LLMs). The paper demonstrates SceneDiffuser's effectiveness in the Waymo Open Sim Agents Challenge, achieving top performance among diffusion models.
优点
- Innovative Approach: SceneDiffuser's use of amortized diffusion for simulation rollout generation is a creative solution that significantly reduces the computational cost per step.
- Unified Framework: The model's ability to handle both scene initialization and rollout in a unified framework is a notable strength, simplifying the simulation process.
- Controllability: The introduction of generalized hard constraints and the use of LLMs for constraint-based scene generation offer a high degree of control over simulation scenarios.
- Performance: Achieving top performance in the Waymo Open Sim Agents Challenge indicates that SceneDiffuser is effective in real-world applications.
- Scalability: The model's performance improves with increased computational resources, showing that it can scale with available hardware.
缺点
Despite mitigation efforts, the paper acknowledges that closed-loop errors remain a challenge, indicating room for further improvement.
While the model performs well among diffusion models, it does not exceed the current state-of-the-art performance for other autoregressive models, suggesting a need for comparison and potential integration.
问题
How does SceneDiffuser handle scenarios with a high number of agents or complex traffic situations not seen in the training data?
How does the model ensure the diversity and representativeness of the generated traffic scenarios?
局限性
- Model Validity: The paper does not explicitly model validity masks, relying instead on logged validity, which could be a limitation for scenarios not covered in the logs.
- SOTA Comparison: While the model performs well among diffusion models, it does not exceed the current state-of-the-art performance for other autoregressive models, suggesting a need for comparison and potential integration.
Thank you for your thoughtful and constructive feedback. We address the comments below:
While the model performs well among diffusion models, it does not exceed the current state-of-the-art performance for other autoregressive models, suggesting a need for comparison and potential integration.
Potential integration of diffusion models with autoregressive approaches is an interesting and promising direction. In fact, amortized diffusion also follows an autoregressive rollout schedule, making it possible to integrate with AR models. One interesting recent work (Autoregressive Image Generation without Vector Quantization from Kaiming He’s team) also hints at this direction in combining the autoregressive models with diffusion models that operate in the continuous vector space. We acknowledge that this is an interesting and promising frontier, where we hope our work serves as one of the first explorations in this space.
How does SceneDiffuser handle scenarios with a high number of agents or complex traffic situations not seen in the training data?
The Waymo Open Motion Dataset (WOMD) is specifically mined for a high number of agents in the scene, with up to 128 agents per scene, accompanied by complex scenarios. By designing our transformer backbone to iterate through axial attention separately across agents and time, it reduces the complexity of scaling to more agents. See Figure 13 (appendix) for examples from the held-out validation set, containing some very dense and complex traffic scenarios generated by our model.
How does the model ensure the diversity and representativeness of the generated traffic scenarios?
Thanks for this question. We can try to answer from three angles:
-
Diffusion models by design are known for being able to learn complex, diverse and multimodal distributions, which is also observed in other works (e.g., Diffusion Policy: Visuomotor Policy Learning via Action Diffusion). Diffusion models sample from a bias-free Gaussian prior that enables generation of diverse outputs.
-
We implicitly measure and capture the diversity and representativeness of the generated scenes in our measured metrics. Since the unconditional scenegen and sim agent metrics based on the WOSAC challenge measure for distributional realism (measured from 32 samples), precision and recall from the distributional realism metrics reward “representativeness” and “diversity” respectively.
-
For a quantitative assessment of diversity and representativeness, see Figure 13 (appendix) for generated examples from our model containing some very dense and complex traffic scenarios.
I appreciate the response and all generally makes sense. I'll maintain my score.
We thank the reviewers for their thorough and thoughtful comments. We are pleased to see that all reviewers are overall positive about the work, finding our proposed amortized diffusion rollout to be a “creative solution” (XARz) that alleviates “one of the major drawbacks (of diffusion policies)” (j5Jn). Reviewers also appreciate our unified framework that “simplifies the simulation process” (XARz) given that “there are no unified models previously” (xCJf). We are happy that reviewers find that our “figures are very illustrative” (xCJf) and the paper is “easy to understand” (xCJf).
We will address all technical questions in the per-reviewer rebuttal section. For comments regarding typos, figure / table / context layout arrangements and more detailed clarifications, we will address them in the final camera-ready paper if the work is accepted to the conference.
This work introduces SceneDiffuser, an efficient and controllable paradigm for generating simualtion in autonomous driving. The paper points out two critical challenges (controllability and inference efficiency) in the field and addressed them with solid experiments and novel methods (generalized hard constraints, amortized diffusion).
All reviewers reach concensus that the manuscript is ready for publication. The rebuttal did a good job addressing reviewers concerns. Good job! Please incorporate all comments and revise the draft for camera-ready.