6.6

/10

Poster5 位审稿人

最低5最高9标准差1.6

4.2

置信度

正确性3.4

贡献度3.2

表达3.4

NeurIPS 2024

SimGen: Simulator-conditioned Driving Scene Generation

Yunsong Zhou,Michael Simon,Zhenghao Peng,Sicheng Mo,Hongzi Zhu,Minyi Guo,Bolei Zhou

OpenReview PDF

提交: 2024-05-07更新: 2024-12-30

TL;DR

This paper introduces a simulator-conditioned diffusion model, which follows the layout guidance from the simulator and cues of the rich text prompts to generate images of diverse driving scenes.

摘要

关键词

Autonomous DrivingGenerative ModelsSimulators

评审与讨论

审稿意见

评分: 6置信度: 32024-06-19

This paper aims to provide a photorealistic appearance to conventional graphics engine. The basic idea is to use controlnet to convert the rendered semantic mask + depth to real images, similar to the setting of image-to-image translation and style transfer. This paper also introduces the DIVA dataset that contains driving logs with diverse appearances and layouts.

Update after rebuttal: I've raised my score from Borderline reject to Weak accept. See comments for details.

优点

This paper presents a novel application that enables graphics engine with realistic appearances using diffusion prior. It can create driving scenarios with various appearances such as different illumination and weather.
To provide a diverse and realistic generation, SimGen introduces the DIVA Dataset collected from high-quality YouTube driving videos to help the training.
The overall flow of the presentation, including writing and figure demonstration, is smooth and easy to follow.

缺点

Modern perception systems in self-driving cars rely on surround-view cameras to perceive the 360 degrees of the world. This paper seems to neglect this problem and only simulates front-view cameras.
Conditioning on simulator may not be necessary. We know that simulators can render sharp and accurate semantics and depth. However, this paper does not leverage these advantages and instead proposes CondDiff, which trains a diffusion model to convert SimCond to RealCond, as simulated conditions have a distinct distribution from real-world conditions. Meanwhile, simulator data is sometimes converted from real-world data using its lane layout and 3D box. So, why not reconstruct the mesh or 3D occupancy from real-world data, and then perform conditional generation? In this way, there will be no domain gap between real conditions and simulated conditions, and it also maintains controllability.
This paper only uses certain interfaces of the simulator, which can be trivial in implementation, as it can easily replaced by low-level traffic simulation and then convert the projected 3D box masks to RealCond. Thus leveraging simulators is questionable, and underlying technical contribution can be limited.
The generation result of SimGen is restricted to the asset bank of the simulator itself. It can only generate certain categories that are available in the simulator. As a result, SimGen only demonstrates improvement over certain categories (road, car, and truck) in downstream tasks, as shown in Table 4 and Table 5.

问题

How many categories can SimGen generate? It seems that not all categories in nuscene will be generated as discussed in line 128.
What is the difference between the proposed DIVA dataset and the OpenDV-YouTube dataset from GenAD? It is also collected from YouTube.
As the CFG scale (classifier-free guidance scale) impacts the conditioning strength, I am wondering how the CFG scale impacts the metric of pixel diversity.

局限性

The authors have discussed their work's limitations and potential negative societal impact.

作者回复

2024-08-07

Before diving into the details, we reclaim the benefits of our simulator-conditioned pipeline.

Simulators can reconstruct scenarios from public datasets, real-world driving logs (e.g., Waymo, nuPlan), and program-generated scenes to obtain data with diverse layouts and annotations. Also, simulators allow for editing the motions and states of traffic participants with rules or interactive policies [1], enabling the generation of risky scenarios from driving records. Finally, simulators provide control over all objects and spatial positions, facilitating the controllable generation.

W1: Neglection of surround-view cameras.

We agree that multi-view generation is meaningful for driving systems, but it is diagonal to the problem we tackle here. Our goal is to achieve appearance and layout diversity in generating novel scenarios, while multi-view generation emphasizes consistency between adjacent views. They are two interrelated yet distinct. Rather than just overfitting on nuScenes [2], SimGen learns from both datasets and web data to ensure content diversity. The front-view setup provides SimGen with a unified IO for utilizing data with various sensor setups. We plan to achieve multi-view consistency via cross-view attention [2] and learning from 360-degree images [3] in our future work.

W2: Necessity of conditioning on simulators.

Conditioning on the simulator has significant advantages.

The MetaDrive simulator provides control over all objects and their 3D locations in a traffic scenario. It also provides access to 100K real-world traffic scenarios imported from Waymo and nuPlan. Besides, it can effortlessly generate scenes encompassing risky behaviors with its physics engines.
While mesh and occupancy can be utilized for conditional generation [4], existing annotators struggle to reconstruct scenes from web data. Thus, the model's generalization capability is limited to public datasets. Differently, semantics and depth can be rendered from simulators, and pseudo-labels can be obtained from datasets and web data. Therefore, SimGen achieves superior generation diversity while preserving controllability by corporating the simulator.
It also paves the path for agent-based interaction to unify the generated perception with downstream decision-making and simulation applications.

W3: Implementation of incorporating simulators.

Incorporating simulators is not trivial. Beyond implementing interfaces, we customize and extend the functions within MetaDrive, including importing attributes from real logs, modifying asset shapes, customizing camera parameters, designing instance cameras, etc. All codes will be made publicly available.
The usage of simulators cannot be replaced by traffic simulations like LimSim [5], as spatial conditions help enhance the quality of generated images. It's not feasible to export the first-view conditions (e.g., the semantic shapes) from a traffic simulator. While projected 3D boxes offer basic control over object positions and sizes, simulator assets closely resemble real objects in terms of 3D shapes and provide details like vehicle doors, windows, wheels, pedestrian limbs, etc. Moreover, annotating 3D boxes from web data is challenging, restricting models based on traffic simulation to generate scenarios within a limited amount of fully annotated data.

W4: The generation is restricted to simulator assets.

The simulator's assets include 12 foreground objects, such as vehicles, pedestrians, cones, traffic signs, etc., not limited to just roads and cars. For a specific category, SimGen can generate diverse appearances based on texts, not confined to the assets. Tab. 15 (9th row) shows SimGen generates different types of vehicles (vans, trucks, and buses), differing from the predefined asset. Besides, we report the per-class results of Tab. 4-5 in Supp Tab. 2-3. Tab. 2 shows SimGen's excellent controllability, with only a -4.8 decrease in the bus category. Tab. 3 validates SimGen's benefits across all nuScenes categories, including motorcycle (+2.9), cone (+1.6), and barrier (+1.7).

Questions.

1. Aligned with semantic labels of Cityscapes, SimGen can generate 19 categories, covering all classes in nuScenes and 97% pixels in daily driving scenes. Fig. 14 shows the diverse categories that SimGen can generate, including poles (r1c2), buses (r3c1), motorcycles (r3c3), barriers (r4c4), etc. Besides, the Supp Tab. 3 confirms SimGen generates all categories in nuScenes.

2. Part of DIVA dataset, DIVA-Real, shares similarities to OpenDV as both use YouTube data. But DIVA targets different tasks and has different data-preparing pipelines. Task objective: DIVA is designed for controllable image generation. To support multimodal conditions, DIVA offers depth and segmentation labels. These labels enable SimGen to learn and control the layouts and behaviors of generated scenarios. OpenDV is collected mainly for predictive video generation, so there is no need to collect these labels. Automated pipeline: The collection of DIVA is fully automated, facilitating data scaling while ensuring quality. In contrast, OpenDV requires manual quality checks, which is less efficient.

3. In Supp Tab. 9, as the CFG scale increases from 5 to 14, pixel diversity rises from 21.3 to 28.2 before reaching saturation. The rise is due to changes in image contrast and sharpness; as the scale hits a threshold, the image distinctions diminish.

[1] CAT: Closed-loop Adversarial Training for Safe End-to-End Driving.

[2] DrivingDiffusion: Layout-Guided Multi-view Driving Scene Video Generation with Latent Diffusion Model.

[3] Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion.

[4] Wovogen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation.

[5] LimSim: A Long-term Interactive Multi-scenario Traffic Simulator.

2024-08-08

Thank you for the detailed response.

One of the methodologies of this paper is to replace the condition of the projected 3D box generated by traffic simulator with the (semantic) mesh renderings generated by traffic+graphic simulator. However, the mesh rendering has some flaws so this paper has to train CondDiff that converts mesh renderings to more realistic masks.

So the critical problem is, why this process can not learned by converting a projected 3D box to a realistic mask? I find the author's rebuttal does not convince me. It would be more convincing to provide an experiment to evaluate it (I know it is hard during rebuttal). If we can already train a CondDiff that converts a projected 3D box to a realistic mask, the condition on a graphic simulator is unnecessary.

The author mentions that mesh provides details like vehicle doors, which I believe is a minor improvement, and can also be supplied into a projected 3D box using global prompt or instance-level customization [1]. Besides, as we rely on CondDiff to refine mesh rendering, the fine-grained property may be hard to guarantee (correct me if I am wrong).

I appreciate this paper with careful experiments and beautiful visualization, but I remain with concerns that have yet to be addressed.

[1] MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

评论- Response to the motivating issues of the simulator-conditioned pipeline

2024-08-09

Thank you for your valuable comments. We acknowledge the capability of diffusion models to convert projected 3D boxes into realistic masks or even to directly generate driving scenes [1, 2]. Nevertheless, when compared to our proposed simulator-conditioned pipeline, they exhibit inherent limitations.

1. Instead of just overfitting on public datasets like nuScenes, we emphasize generalizing to novel scenarios with appearance and layout diversity. As stated in L172, training a box-to-mask conversion model with sound generalizability requires paired 3D box and semantics data far exceeding small-scale fully annotated datasets. Differently, the simulator-conditioned scheme enables learning from web data bridged by depth and semantic conditions. Tab. 4 shows that SimGen, trained on DIVA, exhibits superior diversity far beyond existing methods (+6.5, 32% improvement).

2. Generating realistic masks from boxes poses a greater optimization challenge compared to learning from simulator assets, impacting the quality and controllability of the generated images. While 3D boxes offer only rough control over positions and dimensions, assets closely resemble 3D shapes, e.g., trucks (long and square), sedans (flattened and smooth), and cones (triangular and round). Additionally, the table below presents the controllability evaluation results of SimGen-nuSc conditioned on boxes and simulators in nuScenes. The results validate that simulator-conditioned generation surpasses learning from boxes, especially in categories like cars (+2.6), motorcycles (+1.9), and cones (+3.1).

Condition	FID $\downarrow$	Car $\uparrow$	Truck $\uparrow$	Bus $\uparrow$	Trailer $\uparrow$	Constr. $\uparrow$	Ped. $\uparrow$	Motor. $\uparrow$	Bicycle $\uparrow$	Cone $\uparrow$	Barrier $\uparrow$
Box	16.7	36.5	17.3	36.1	10.6	0.4	24.4	21.4	20.3	41.8	43.5
Simulator	15.6	39.1	18.1	38.9	12.8	1.4	26.6	23.3	21.9	43.0	46.6

3. Providing details is not a minor improvement of simulators. While prior research [3, 4] can supply instance-level customization (prompts) into box conditions, the pixel-level control derived from simulators is much more precise. Also, the instance-level prompts necessitate extra text annotations that are currently not available in public datasets, while simulators inherently offer detailed controls. It is not hard to guarantee the fine-grained property with CondDiff, as it already learns pixel-level condition generation from large-scale web data.

[1] DrivingDiffusion: Layout-Guided Multi-view Driving Scene Video Generation with Latent Diffusion Model.

[2] Panacea: Panoramic and Controllable Video Generation for Autonomous Driving.

[3] MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis.

[4] LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation.

2024-08-09

Thanks for your detailed feedback.

The author's rebuttal has largely addressed my concerns.

I've raised my score from Borderline reject to Weak accept.

I have one final question. In the experiments between the Box and Simulator, is the CondDiff model being trained on the (mesh mask → real mask) pair, or is it trained on the (box mask → real mask) pair?" If using (mesh mask → real mask) pair to refine the box mask, there inherently exists a domain gap.

2024-08-09

Thank you very much for your in-depth comments. In this experiment, we replace the simulator's original assets with cube-shaped assets indicating categories, dimensions, and orientations. The simulator then renders these assets to generate depth and semantic masks. The CondDiff is trained on pairs of cube mesh masks and real masks, ensuring fairness and eliminating any domain gap.

审稿意见

评分: 9置信度: 52024-06-21

This paper proposes a novel framework named SimGen, which aligns the web-scaled unannotated driving footage with the simulator-generated images and annotations to obtain the appearance diversity and traffic layout diversity for generating novel driving scene images.

优点

The paper tackles a challenging but under-explored problem: how can we best use the real-world and simulator data for autonomous driving together? The real-world driving footage is abundant on the internet and has great appearance diversity but lacks task annotations. On the other hand, the simulated data can easily obtain the task annotations but are still visually distinct from the real-world data. In this paper, the authors propose a novel image-generation framework for driving scenes named SimGen, which can utilize both real-world data and the flexible simulator to generate diverse driving scene images both for appearance and traffic layout. This work is remarkably valuable for the autonomous driving community.

The proposed SimGen framework designs a two-stage cascade diffusion process to alleviate the domain gap between the simulation and the real-world, as well as tackle the misaligned modalities of conditions from different data sources. This framework is wisely designed and can easily be extended to other robotics tasks, even beyond the scope of autonomous driving.

In the experiment, the proposed SimGen has shown the best image generation quality and diversity compared to existing data, which proves the usefulness of the proposed SimGen framework.

Moreover, the proposed DIVA dataset in this paper contains great diversity in appearance, weather, lighting, and location. This can significantly enhance the existing autonomous driving datasets, in which only a few cities are involved.

缺点

I just have few content presentation suggestions for this paper.

I think moving the Related Work to the earlier part of the paper will give the readers a better understanding of the contributions of this paper compared to existing works. Besides, I think the generalizability of the trained SimGen shown in Appendix D.3 “Simulator-conditioned image generation,” is quite impressive and may be better to move to the main paper. The trained SimGen can generate diverse images based on the conditions provided by the novel simulator CARLA in zero-shot can surely illustrate its generalizability.

问题

I am curious about the solution of the statistical shortcut that the authors have mentioned in Appendix C.1.2. The core concern is that YouTube data has no ExtraCond but nuScenes data has, such that “outputting nuScenes-style images when ExtraCond is present and YouTube-style images when it’s absent.” I am wondering why this can be solved by an adapter that merges various modalities into a unified control feature.
If I understand correctly, in DIVA-sim, the safety-critical data originating from the Waymo Open Dataset are only used to showcase the proposed SimGen’s ability to generate realistic images for diverse traffic layouts qualitatively. These data are not used for training SimGen since no ground truth real images can be obtained. Also, these data are not used for data augmentation or evaluation for the pre-trained perception models. I am wondering whether the authors can illustrate the usage of the safety-critical data originating from Waymo more clearly.
In terms of future work to extend SimGen to multi-view image generation, is there any idea that the author is willing to discuss since the multi-view video footage are not that abundant on the internet?

局限性

The authors have discussed the limitations of this paper in the main paper.

作者回复

2024-08-07

W1: Content presentation suggestions for this paper.

Thank you for the suggestion. We follow your advice to reorganize the Related Work section and move the content in Appendix D.3 to the main paper.

Q1: The solution of the statistical shortcut.

Integrating input conditions through an adapter is a common practice in training image generation models with multi-modal conditions [1], which serves as our foundation. Randomly dropping conditions within the adapter helps the model balance various condition modalities, preventing overfitting to specific ones. The adapter merges information from different modalities to reduce uncertainty caused by randomly dropping. Furthermore, we mitigate conflicts between multi-modal conditions by masking out simulator background information, preventing the model from learning associations between simulator backgrounds and nuScenes-style images.

Q2: The usage of the safety-critical data originating from Waymo.

The collection of safety-critical data is a part of our contributions. The DIVA-sim dataset includes simulated driving videos and annotations of hazardous behaviors initialized from Waymo, ensuring diverse scene generation and advanced simulation-to-reality research. Annotations such as object attributes, lane lines, trajectories, etc., can be derived from the simulator. In addition to showcasing how SimGen can generate safety-critical scenarios qualitatively based on DIVA-Sim, the labeled driving scenes can also be utilized in map segmentation and object detection tasks for data augmentation. Furthermore, agents learned from safety-critical data have the potential to achieve superior driving safety [2]. We leave these parts as our future work.

Q3: Future work to extend SimGen to multi-view image generation.

Thank you for the suggestion. Regarding extending SimGen to multi-view image generation, we could consider leveraging 360-degree images from Google Street View [3], which have recently been used to train diffusion models for generating realistic multi-view street scene images. Additionally, we might incorporate cross-view attention mechanisms to ensure consistency across different views, similar to the approaches demonstrated in DrivingDiffusion [4] and Panacea [5].

[1] UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild.

[2] CAT: Closed-loop Adversarial Training for Safe End-to-End Driving.

[3] Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion.

[4] DrivingDiffusion: Layout-Guided Multi-view Driving Scene Video Generation with Latent Diffusion Model.

[5] Panacea: Panoramic and Controllable Video Generation for Autonomous Driving.

2024-08-13

Thanks for your rebuttal. The rebuttal addressed all of my concerns. I will keep my rating as Very Strong Accept due to the significant contribution and novelty of this paper.

审稿意见

评分: 5置信度: 42024-07-09

The paper introduces SimGen, a framework for generating diverse driving scenes to reduce annotation costs in autonomous driving. SimGen combines simulator and real-world data using a novel cascade diffusion pipeline to address sim-to-real gaps and multi-condition conflicts. It is enhanced by the DIVA dataset, which includes extensive real-world and simulated driving data. SimGen improves synthetic data augmentation for BEV detection and segmentation tasks, with code, data, and models to be released.

优点

The paper is well organized.
The proposed DIVA dataset is designed in an elegant manner and could be beneficial to the research community.
The generated images are of good quality and diversity.

缺点

Despite achieving superior results in single image generation, the 3D consistency has not been evaluated.

问题

Without videos demonstrating 3D consistency, the proposed method is less practical for autonomous driving applications.
I am confused about the conditions used in SimGen. Figure 3 indicates that DIVO synthetic images are not used in the generation process, whereas Figure 4 shows them being used as conditions. Could you clarify where the synthetic images are actually utilized?

局限性

The limitations are discussed in the paper. However, the social impact is not discussed.

作者回复

2024-08-07

W1: Despite achieving superior results in single image generation, the 3D consistency has not been evaluated.

We agree with the reviewer that achieving 3D consistency will be an important future direction. However, it is not the main focus of this submission. We have preliminary attempts at video generation and design the temporal consistency in Sec. 4.3 and Appendix D.2. Fig. 5, 12, and 16 provide visualization examples, and Tab. 11 indicates that SimGen performs comparably to other driving video generation models. Nevertheless, ensuring and evaluating 3D geometry consistency in generated driving scenes poses a challenge, with all existing work (e.g., Panacea, DrivingDiffusion, GenAD) struggling to achieve this. Even powerful video generation models such as Sora and SVD have similar difficulties.

Q1: Without videos demonstrating 3D consistency, the proposed method is less practical for autonomous driving applications.

Although SimGen cannot guarantee 3D consistency, the capability to generate diverse driving images and videos with annotations is crucial to many applications. Tab. 5 demonstrates how SimGen can generate diverse driving scenarios for data augmentation in map segmentation and 3D object detection models. Fig. 5 and 12 showcase its applications in generating hazardous driving scenarios and closed-loop evaluation. Other applications like lane detection and BEV detection can also benefit from synthetic image-annotation pairs.

Q2: I am confused about the conditions used in SimGen. Figure 3 indicates that DIVA synthetic images are not used in the generation process, whereas Figure 4 shows them being used as conditions. Could you clarify where the synthetic images are actually utilized?

As illustrated in Fig. 3 and Tab. 2, a scene record is fed into the simulator to generate DIVA synthetic images, which are further grouped into SimCond (simulated depth and segmentation) and ExtraCond (rendered RGB images, instance maps, and top-down views). SimCond is transformed into RealCond by the CondDiff module and then, along with ExtraCond and text, is fed into the ImgDiff module to generate real images. More descriptions on this part will be included in the revision.

Limitations.

We have discussed the social impacts in Appendix A.

2024-08-13

Thanks for your rebuttal. The rebuttal addressed all of my concerns. I will keep my rating as Borderline Accept.

审稿意见

评分: 8置信度: 52024-07-13

This paper presents a simulator-conditioned scene generation framework, SimGen, which generates diverse driving scenes conditioned on simulator controls and texts. To support the model training, the authors also introduce a driving video dataset, DIVA, comprising 147.5 hours of both real-world and simulator-based driving data. SimGen can be controlled by both text prompts and layouts from the simulator. The synthetic data generated by SimGen can be used for data augmentation on BEV detection and segmentation tasks. The authors also demonstrate its ability to create safety-critical data.

优点

The idea is interesting and significant for the field of autonomous driving. This method generates realistic visual data based on simulator conditions, offering flexibility to data collection in rare and safety-critical scenarios.
Flexible conditioning signals. The authors greatly utilize 2d visual representations like depth and segmentation as the conditioning signals for sim2real image generation. These 2d representations are more flexible to transfer and easy to collect with well-established visual models. This is especially important for front-view YouTube videos, where 3D conditioning signals, e.g., 3D bounding boxes, are hard to estimate.
The proposed dataset, DIVA, would be helpful for the community. DIVA is a large mixed dataset comprising both real-world and simulated driving data. It leverages the advantages of two different data sources to improve appearance and layout diversity.
The qualitative visualizations of generated images are abundant and intuitive.

缺点

[Major] Unclear usage of DIVA-Sim dataset in training. Even though it is clear how the SimGen works given the layout from the simulator, it is still confusing to me how the DIVA-Sim is incorporated during training. From lines 176-177, it says “The training does not contain data rendered from simulators.”, showing that DIVE-SIM is not used for the training of CondDiff (Sec. 3.1). So is it only used for the training of ImgDiff? If so, it means that the ImgDiff could process both SimCond and RealCond in training to generate simulated images and realistic images respectively, would that cause some optimization issues like unstable training since these two condition- and image distributions are significantly different? The authors are highly suggested to specify the usage of each part of the dataset in the main paper for readability. Adding a subsection in Sec 3. might be helpful.

[Major] Lack of scalability analysis for data augmentation experiments. In Table 5, the author incorporates the equal-sized synthesized data as an augmentation to improve the performance of perception models. However, it is important to investigate how the performance goes with more synthesized data, such as 200% or even more.

[Minor] Lack of high-level system workflow for readability and better understanding. The authors are encouraged to add a small yet concise figure to distinguish the training and inference procedure, where each node in the figure should present one stage in the training/inference pipeline, such as the Sim-to-Real Condition Transformation in Sec 3.1. Different from the architectural design in Figure 3, this figure focuses on the high-level workflow of the system instead of module details. Note that it’s not required for rebuttal, but should be considered for future revisions.

问题

See weakness 1 about DIVA-Sim dataset.

局限性

The authors have included a limitation section.

作者回复

2024-08-07

W1: Unclear usage of DIVA-Sim dataset in training.

As stated in L158, Tab.2, and Appendix C.3, DIVE-Sim is only used in the training of ImgDiff. ImgDIff takes RealCond from DIVA and nuScenes and ExtraCond from DIVA-Sim as inputs, without including SimCond, and outputs real images. Consequently, there are no optimization issues. We will add more descriptions for Tab. 2 in the revision.

W2: Lack of scalability analysis for data augmentation experiments.

In Supp Tab. 8, we assess the various proportions of synthesized data as an augmentation strategy to boost the perception model (BEVFusion). The results indicate that performance improves with an increase in synthesized data, from 47.7 to 51.9 on $\text{AP}_{3D}$ . This validates SimGen’s data scaling capability for data augmentation in driving applications.

W3: Lack of high-level system workflow for readability and better understanding.

We will add a figure to illustrate the system workflow in the revision.

2024-08-12

Thank you for your feedback, and I will keep my score as strong accept considering the quality and potential impact of this work.

审稿意见

评分: 5置信度: 42024-07-14

The paper presents SimGen, a framework for generating diverse and realistic driving scenes using a cascade diffusion pipeline, aiming to generate controllable, diverse, and realistic autonomous driving images. It also introduces the DIVA dataset, including real-world and simulated data to enhance training diversity for autonomous driving tasks.

优点

The problem is interesting and timely and the dataset could be useful for the community. Collecting videos from YouTube and the engineering efforts to create a proper dataset are valuable contributions.

缺点

While the paper tackles an intriguing problem, there are some concerns about it:

DIVA Dataset quality: The quality of the provided dataset is unclear. L112 states that “videos with rich annotations are collected.” What are the performance metrics for the used VLM, depth estimator, and segmentator? In other words, what is the quality level of the annotated data? Additionally, Tab1 is misleading; it should include a column indicating whether the annotations are human-labeled or pseudo-labeled.
SimGen performance: The performance of the SimGen framework is not clearly demonstrated. In Tab3, the performance is on par with DrivingDiffusion. If other methods were trained on DIVA, would their performance be the same as SimGen’s? For a more comprehensive comparison, I suggest training SimGen on DIVA only (not on nuScenes) and then evaluating it on nuScenes. The current improvement might be due to having more data (simple augmentation).
The utility of the generated scenarios: Tab5 is a nice informative table but it shows limited utility of the generated scenarios in other tasks, especially when the simulator is trained only on nuScenes. Why don't the generated scenarios show higher improvements? Could authors report complete data for Tabs 4, 5, and 6 (for all classes)?
Dataset diversity: The diversity of the dataset is unclear. While Tabs 7-10 provide some information, a quantitative comparison with other datasets is needed. The same applies to “Corner cases”—some statistics are necessary beyond the four qualitative figures provided.
Safety-critical scenarios: The ability to generate those is one of the main claims of the paper but what is the authors' definition of safety-critical scenarios? How can these be compared with scenarios generated by other methods or datasets? How do they ensure these scenarios are realistic / feasible?

问题

Following the previous points:

Have you tried conventional Sim2Real methods? A more detailed discussion on why CondDiff performs better is helpful.
I’m curious how many diffusion steps (t_s) on average were needed to map from synthetic to real (CondDiff)?
Why doesn’t SimGen work as well as GenAD for video generation? Any hypotheses?
Is fig11 a failure case of Sim-to-Real transformation? The semantic map of the street at night is very different; the pedestrians got disappeared when given the "Mountains" text, …

Minor issues:

The presentation needs improvement. There are ambiguities in organization of sections and some sentences like L257, L159, and L214.

L174, L766, L891, …“It’s” and L789 “doesn’t” “can’t”
L295 “singe-frame”
Fig3 caption: “Eventually, the text features, SimCond, and ExtraCond are inputted into the ImgDiff module” —> RealCond?

局限性

The authors have mentioned several limitations in their work but their discussion is somewhat general.

While the authors mention some failure cases, there is no discussion on the possible reasons for these failures or how they could be avoided.

作者回复

2024-08-07

W1: DIVA Dataset quality.

Supp Tab. 1 includes the evaluation of the annotation quality. To evaluate the VLM, we utilize the DIVA-Real and nuScenes datasets and employ the widely used ROUGE-L metric (84.4 and 85.2) [1] to assess the similarity between the annotated data and pseudo-labels. The pseudo-labels are derived from GPT4V and manually checked to ensure reasonableness. For the evaluation of the depth estimator and segmentator on the nuScenes dataset, we leverage the metrics of absolute relative error (0.118) and mIoU (82.4), respectively. The results indicate that our annotated data exhibits a sound data quality level. Besides, we follow your advice to include a column indicating whether the annotations are human-labeled or pseudo-labeled in the revision.

W2: SimGen performance.

For a fair comparison, we have set up the SimGen-nuSc within the nuScenes dataset in Tab. 3. SimGen-nuSc surpasses DrivingDiffusion in terms of image quality and diversity by -0.3 and +0.4 respectively under the same training data. If other methods were trained on DIVA, there might be a performance boost similar to that from SimGen-nuSc to SimGen (+6.1). However, methods like BEVGen and MagicDrive require explicit labels like 3D boxes and lane lines for training, preventing them from utilizing the web data. Moreover, training SimGen on DIVA only and testing on nuScenes would be unfair as it raises domain generalization issues in those works.

W3: The utility of the generated scenarios.

The DIVA dataset is an essential part of our contribution. Aside from the improvements resulting from the framework (demonstrated by SimGen-nuSc), Tab. 5 also shows that SimGen trained on the DIVA dataset brings higher improvements and outperforms baselines by a large margin. In Supp Tab. 2-4, we report complete data for SimGen's performance in 3D object detection tasks. Tab. 3 witnesses the benefits of SimGen on all categories in nuScenes, including motorcycle (+2.9), cone (+1.6), and barrier (+1.7). The results indicate that SimGen can generate a diverse range of objects (10 categories) covering common driving scenarios, thereby confirming its utility in practical applications.

W4: Dataset diversity.

In Supp Tab. 5-6, we provide a comparison between DIVA and nuScenes. nuScenes data is only from Boston and Singapore, while DIVA includes data from South America (8.5%), Europe (16.9%), Asia (14.6%), and Africa (3.1%). In nuScenes, 88.4% of the data is during the daytime, and 80.5% is on normal days. In contrast, DIVA collects data at dawn (16.3%) and dusk (10.1%), covering cloudy (28.6%), snowy (10.2%), and many other weathers (3.1%). Also, 75.3% of nuScenes data is keeping forward and turning, while DIVA includes lane changing (28.1%), intersection passing (6.6%), and U-turns (1.2%), reflecting more complex traffic layouts. The results show that DIVA has a more diverse and balanced data distribution than nuScenes. In addition, Tab. 7 reports statistics on corner cases, including 9 kinds of behaviors of other vehicles that lead to safety-critical driving scenarios. The statistics report that the proportions of crash back and cutting in are 41.2% and 19.0%, respectively.

W5: Safety-critical scenarios.

A safety-critical scenario is a situation where one or more vehicles collide with the ego vehicle, which is rare to collect in real-world datasets like Waymo. We utilize CAT [2] to generate risky behaviors from logged scenarios to ensure reality and feasibility, which uses a data-driven motion prediction model that predicts several modes of possible trajectories of each traffic vehicle. Please refer back to [2] for a detailed description of safety-critical scenarios.

Q1: Detailed discussion on CondDiff.

We have tried the conventional Sim2Real method which focuses on photorealism enhancement [3]. As discussed in Appendix A and C.1, the pixel-to-pixel transformation only brings limited appearance diversity. It is like applying an image filter or style transfer to images rather than generating new content, with the outputs strictly constrained by input conditions. If input conditions derived from a simulator lack background information (e.g., trees, buildings, etc.), the output images may completely ignore generating any background content. Moreover, such a model struggles to alter visual appearances via text prompts in generating novel scenarios.

Q2: Diffusion steps for Sim2Real transformation.

Appendix C.2 shows the diffusion steps ( $t_s$ ) for Sim2Real transformation is 0.5.

Q3: Comparision of SimGen to GenAD.

SimGen aims for controllable image generation in driving scenes, so it is not specifically designed for video generation. We only have a preliminary attempt to achieve temporal consistency. GenAD incorporates spatiotemporal optimizations and is designed for video generation, but it has much less flexibility and controllability in content generation compared to SimGen.

Q4: Explanations on Sim2Real transformation.

Fig. 11 visualizes Sim2Real transformation under varying text prompts rather than a failure case. The real conditions generated by the model may be related to the training data distribution. Despite "street at night" not mentioning trees, the model still envisions the presence of far trees in the scene and generates a corresponding semantic map. Due to the scarcity of pedestrian pixels in the simulated condition, the model may overlook this condition when the text prompt is "Mountains" where there are supposed to be no pedestrians, resulting in their disappearance.

Minor issues.

We will reorganize the sentences and correct the typos in the revision.

[1] ROUGE: A Package for Automatic Evaluation of Summaries.

[2] CAT: Closed-loop Adversarial Training for Safe End-to-End Driving.

[3] Enhancing Photorealism Enhancement.

作者回复

2024-08-07

Dear reviewers and ACs,

We sincerely thank all reviewers for their detailed and constructive comments. It is encouraging that reviewers acknowledge our pioneering efforts in establishing a simulator-conditioned generative model in driving scenarios. We have taken each comment into consideration, added more requested ablative studies in the rebuttal, and clarified some implementation details.

The attached PDF includes tables with quantitative results. We will refer the reviewer to the corresponding component in the following detailed responses. Additionally, we will also incorporate these results into our revised paper.

Please refer to the rebuttal blocks below for our point-by-point responses to each reviewer.

Thank you for your time and effort once again! We hope our rebuttal can address your concerns, and you are more than welcome to ask any further questions. Looking forward to your reply!

Best regards, The authors of Submission2300.

最终决定Accept (poster)

2024-09-25

The submission received uniformly positive feedback from the reviewers, all of whom highlighted the strengths of the paper and its contribution to the field. The authors have addressed some of the raised concerns in the rebuttal phase. We invite the authors to address all the weak points for the camera-ready version. Thanks.