PaperHub
5.3
/10
Rejected4 位审稿人
最低4最高6标准差0.8
6
5
4
6
4.3
置信度
正确性2.8
贡献度2.5
表达3.3
NeurIPS 2024

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

OpenReviewPDF
提交: 2024-04-27更新: 2024-11-06
TL;DR

Geometric Controlled Generation Framework for Autonomous Driving. High-quality 3D controllable image generation benefit various downstream tasks.

摘要

关键词
Autonomous DrivingImage GenerationData-centric AI3D Vision

评审与讨论

审稿意见
6

The paper introduces SyntheOcc, a framework utilizing diffusion models to synthesize photorealistic images for autonomous driving simulations. The proposed method addresses limitations in the existing 2D diffusion model to generate multi-view driving videos by integrating detailed 3D geometric data. The authors effectively employ 3D semantic multi-plane images (MPIs) for precise geometric control, enhancing the realism and utility of generated images for training perception models. The paper also proposes re-weighting strategies to address the imbalance problem between foreground, background, and object categories. The experiments prove the effectiveness of the proposed MPI encoder and the reweighting strategies.

优点

  • The paper introduces an innovative approach by incorporating 3D semantic Multi-Plane Images (MPIs) to capture both geometric and semantic details of a scene. This approach allows for the precise modeling of 3D environments in a 2D image synthesis context, enhancing the photorealism and depth accuracy of the generated images.
  • The design of the MPI encoder is very effective in handling the input conditions with a large number of channels while maintaining spatial consistency to the latent features of diffusion UNet.
  • Additionally, SyntheOcc incorporates sophisticated reweighing strategies to address class imbalance and ensure focus on critical features. These include foreground enhancement, depth-aware reweighing, and class-balanced reweighing.
  • The paper outlines a comprehensive set of evaluations to demonstrate the effectiveness of the proposed method. Qualitative evaluations visually demonstrate the photorealism and environmental accuracy of the generated images compared to real scenes from the nuScenes dataset. Quantitative analyses leverage metrics such as Frechet Inception Distance (FID) to measure image quality and evaluate perception model performance, offering solid empirical evidence of the framework's effectiveness. Ablation studies further dissect the impact of various components and design choices in the proposed method. Additional robustness tests are conducted to evaluate how changes in the MPI settings (like variations in depth or semantic labeling) affect the output quality and the training effectiveness of perception models.

缺点

The contributions for reweighing strategies seem to be minor improvements over existing methods (Kai Chen, Enze Xie, Zhe Chen, Lanqing Hong, Zhenguo Li, and Dit-Yan Yeung. Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt. arXiv preprint arXiv:2306.04607, 2023, Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. Class-balanced grouping and sampling for point cloud 3d object detection (arXiv preprint arXiv:1908.09492), which limits the perceived novelty of the paper's contributions.

The paper focuses on scene editing capabilities, but there is a noticeable underrepresentation of object-level editing in the experiments.

Magic drive’s data augmentation is evaluated on two perception tasks BEV segmentation and 3D object detection, with CVT (Zhou & Krahenbuhl , 2022) and BEVFusion (Liu et al., 2023a) as perception models, respectively. Hence, evaluations on same downstream tasks are encouraged for better comparisons to the state-of-the-art baseline.

The paper doesn't provide any evaluations comparing the re-weighing solution proposed by GeoDiffusion.

There are also several noticeable view inconsistencies, eg: Fig 14 - row 2 column 2-3 (clouds seem different), row 7 column 4-5 there is a mismatch in building structures, which are not discussed in the paper.

问题

I do not have any further questions. I would like the reviewers to comment on the weaknesses that I listed in the previous paragraph.

局限性

The authors acknowledge some key limitations in the proposed method. First, it relies heavily on existing data for generating scenes, which means it doesn’t create as much variety as it could. This limits how well it can train models to handle different driving conditions. The paper also struggles with complex scenes, like crowds, where it fails to accurately identify individual people. This is a big deal for autonomous driving, where accurate representations of the scene are crucial for predictions. The authors suggest that future improvements could include better methods for creating diverse scenes and making the model more capable of handling dynamic environments, which would help make the system more practical and effective for real-world applications.

作者回复

Thank you for the detailed feedback and insightful comments. We appreciate your time and great efforts in reviewing. We now reply to each comment below.

(1) Reweighing strategy

We agree with the reviewer that the reweighing strategy represents a minor level of innovation. As a result, we do not view it as a key contribution or novelty in our paper. Our contribution primarily focus on precise 3D controllability for image generation. The reweighing strategy serves as an engineered component designed to improve image quality and condition-generation alignment.

(2) Object-level editing

In our paper, we provide object-level editing in Figure 3 by placing new objects and deleting existing objects. Furthermore, from a broad perspective, all images shown in our paper represent object-level editing from an empty canvas.

(3) Align experiment setting in 3D detection using BEVFusion

To provide a comprehensive evaluation, we evaluate our method on 3D detection. In Gen setting in our paper, our method achieves mAP in 22.3 and NDS in 31.3, higher than MagicDrive’s 20.8 and 30.2. This denotes we achieve a more effective 3D controllable generation than prior work.

(4) Compare reweighing method with GeoDiffusion

We compare our Depth-aware Foreground Reweighing with reweighing method in GeoDiffusion. Analogous to the setting in Table 3, our method achieves 25.50 in mIOU, which is more effective than GeoDiffusion’s 25.07. Our reweighting method is formulated as an inverted cosine annealing, while GeoDiffusion formulated it as an exponential function. Both of them applied it to depth value.

(5) Noticeable view inconsistencies

In our paper, while contributing to 3D controllable image generation, we acknowledge its limitations.

In most of our cases, our method achieves satisfactory consistency in the intersecting area between the views or frames. However, it is important to acknowledge that our method is not perfect. There is still room for improvement in our consistency, as certain cases have instances of artifacts, like Fig 14 - row 2 column 2-3. We discuss a variety of potential reasons for this problem.

  1. Reason 1: Missing occupancy data. In the context of autonomous driving applications, taller buildings are irrelevant to the planning. Consequently, objects exceeding 5 meters in height are excluded from the occupancy data. This leads to the content above this threshold, such as cloud and building, being randomly generated by a diffusion model. In these cases, randomness potentially contributes the inconsistencies in appearance.
  2. Reason 2: Imperfect attention. The second reason lies in the imperfect cross-view and cross-frame attention module. View consistency is compromised due to a limited overlapping region between views, resulting in relatively weak supervision. Meanwhile, achieving frame consistency is hindered by the small amount of video clips in the nuScenes dataset, posing a challenge to deliver satisfactory results. As the scale of our data and model is limited in the current context, the visual results are inevitably limited. In the future, we plan to harness the advanced video foundation model such as (open) Sora and its alternatives to improve consistency.

Certain artifacts may not impact downstream tasks. In general, artifacts above the height threshold may not significantly impact downstream applications, as evidenced by our data augmentation experiments. For tasks such as 3D detection or occupancy detection, the upper regions of images are often cropped out, as they exceed the perception boundary, and are not necessary for planning.

Finally, we would like to emphasize that our primary contribution does not lie in frame-consistent or view-consistent generation. Our work primarily focuses on precise 3D controllability for image generation. In our implementation, the frame and view attention modules are analogous to our baseline method MagicDrive. Although we do not show significant improvement in consistency, the performance of our model is sufficient and adequately robust for downstream application, as demonstrated through our data augmentation experiments. We will continue to enhance the multi-view and frame generation in future work.


If there are any further concerns, we would be happy to address them in further discussion. Thank you!

评论

Thank you for the detailed response and clarifications. Considering the other reviews and the rebuttal of the authors, I am happy to keep my score 6. This paper presents solid work with a fair contribution that should be presented to the community.

审稿意见
5

This paper introduces SytheOcc, a method that employs a diffusion model with 3D occupancy as conditions to generate street view images.

优点

  1. Unlike previous methods that use box conditions, this paper proposes the use of 3D occupancy, resulting in finer geometric control ability.
  2. The paper further suggests the use of 3D semantic multi-plane images to represent the 3D occupancy.
  3. The text and figures are well-presented, and the provided examples are very promising.

缺点

  1. The main concern is the inconsistency between views and frames. Despite using Cross-View and Cross-Frame Attention and 3D occupancy as conditions, the spatial and temporal consistency results are unsatisfactory (e.g., Fig 5 (b) and the video demos). This is not the expected outcome, as incorporating 3D occupancy as a consistent world representation should result in better spatial and temporal consistency. Additionally, it would be preferable to have metric results such as FVD for the temporal experiments.
  2. In Table 1, why does SytheOcc-Aug show worse results for certain categories (e.g., bicycle, moto)?
  3. Table 1 lacks experiments for ControlNet-Aug or ControlNet+depth-aug.
  4. Some discussions regarding 3D occupancy as a 3D geometry condition: 4.1. The field of view (FOV) for 3D occupancy is limited as it is generally generated using lidar, which leads to inconsistency issues for high-rise buildings when considering cameras of larger FOVs. 4.2. The current annotation of 3D occupancy has limited category coverage. It would be beneficial to explore open-vocabulary approaches. 4.3. When using only 2D semantic masks as conditions, the paper mentions the presence of ambiguity (i.e., Fig 6 a0). Can the use of instance-level semantic masks alleviate this problem?

问题

See Weaknesses. Overall, I have a positive view of the paper and am willing to increase the score if the authors can address my concerns.

局限性

The authors have provided a comprehensive list of limitations and future work of their paper.

作者回复

Thanks for your positive comments and constructive suggestions. In the following, we reply to individual questions and comments raised by the reviewer:

(1.1) Inconsistency between views and frames.

In our paper, while contributing to 3D controllable image generation, we acknowledge its limitations.

In most of our cases, our method achieves satisfactory consistency in the intersecting area between the views or frames. However, it is important to acknowledge that our method is not perfect, and there is still room for improvement in our consistency, as certain cases have instances of artifacts. We now discuss potential reasons for this problem.

We first note that our condition of occupancy provides a consistent geometry (shape), but does not provide texture consistency between views or frames. In our video demos of Fig.11 row 5 columns 2-3, the car displays a consistent shape but with texture shifting between views. This case is primarily due to insufficient texture consistency, which remains an unresolved open problem. The underlying reason lies in the imperfect cross-view and cross-frame attention module. Their performance heavily depends on the available training data. View consistency is compromised due to a limited overlapping region between different views, resulting in relatively weak supervision. Meanwhile, achieving frame consistency is hindered by the small amount of video clips in the nuScenes dataset, posing a challenge to deliver satisfactory results. The scale of our data and model limited in the current context, is insufficient to enable these attention modules to achieve perfection in our scenario. In the future, we plan to draw on experiences from video generation and harness the advanced video foundation model such as (open) Sora and its alternatives to improve consistency.

Finally, we would like to emphasize that our primary contribution does not lie in frame-consistent or view-consistent generation. Our work primarily focuses on precise 3D controllability for image generation. In our implementation, the frame and view attention modules are analogous to our baseline method MagicDrive. Although we do not show significant improvement in consistency, the performance of our model is sufficient and adequately robust for downstream application, as demonstrated through our data augmentation experiments. We will continue to enhance the multi-view and frame generation in future work.

(1.2) FVD evaluation

Good idea! The FVD metric serves as an effective measure for assessing video fidelity. As our baseline method MagicDrive does not provide FVD evaluation. We compare our method with DriveDreamer and DrivingDiffusion. Our achieved FVD score of 251 surpasses DriveDreamer's 340 and DrivingDiffusion's 332, demonstrating better video quality.

(2) Results for hard cases

It is a good point. We regard the categories such as bicycle and moto as the hard case and corner case that pose significant challenges for the generative model. As they have articulated structures and complex topologies, generating images with bicycles or moto sometimes becomes a mess in our current capacity. Moreover, bicycles and motos belong to the long-tailed class in the training set. The image generation network inherently suffers from data imbalance issues. As a result, the imperfect data with bicycle and moto will hurt the performance, while larger and easier objects such as buses and drivable surfaces tend to yield positive improvement. In the future, we plan to explore the application of scaling laws to scale of our dataset and model size to mitigate this issue and yield better results on bicycles and motos.

(3) Additional experiment

During rebuttal, we add additional experiments. Due to limited resources, including lab equipment and the very long training time of FB-Occ (4 days x 2), we provide small-scale experiments that train FB-Occ using 30% training iterations to reduce training time, while keeping others unchanged. In the setting of data augmentation, our method achieves an mIOU of 34.9, higher than ControlNet’s 33.2 and ControlNet+depth’s 33.5. These experiments denote that our method provides better effectiveness than the ControlNet baseline.

(4) Discussions regarding 3D occupancy as a 3D geometry condition

  1. Defect in occupancy. We agree that the coverage area of occupancy input is limited. In the context of autonomous driving applications, taller buildings are irrelevant to the planning. Consequently, objects exceeding 5 meters in height are excluded from the occupancy data. This leads to the content above this threshold, such as cloud and building, being randomly generated by a diffusion model. In these cases, randomness potentially contributes the inconsistencies in appearance.
  2. Open-vocabulary image generation. Good suggestion! We will explore open-vocabulary generation to broaden our application domain in the future. In this setting, we will achieve a more fine-grained control than a fixed vocabulary.
  3. Instance-level condition. Adding instance-level conditions will alleviate this problem. One possible method is to incorporate instance masks containing instance IDs such as 0,1,2. This way can be effective in distinguishing overlapping objects but not 3D-aware to capture the underlying 3D shape of objects.

If there are any further concerns, we would be happy to address them in further discussion. Thank you!

评论

Thank you to the authors for your response, and I would like to keep my rating unchanged.

审稿意见
4

The paper propose a new 3D semantic multi-plane images (MPIs) based image generation pipeline, which enables finer geometric control for 3D editing, dataset generation, and long-tailed scene generation. Through extensive experiments, the work demonstrates substantial advancement in generation quality and better alignment between condition and synthesized images.

优点

  1. The work explores a new 3D semantic Multi-Plane Images (MPIs) as a condition, which provides better spatial alignment compared with baselines and enables 3D editing.

  2. The comparison results are comprehensive and demonstrate the effectiveness the proposed method, the ablation is relatively complete to validate the MPI Encoder and the reweighting strategy.

  3. The paper is well-written, and the experimental results are presented clearly.

缺点

  1. The MPI encoder, which is the major contribution, is not novel for me. Although the proposed 3D MPI enables finer control than BEVGen, but the diffusion model also operates on the 2D domain and generates each view and frame separately without strict geometry constraints.

  2. The importance of reweighing is tricky and hard to tune, considering many hyperparameters. Are the m and n in Eq. 6 the same for different datasets?

  3. It’s hard to decide how the method works without a supplementary video, I doubt the view-consistency of the generated video across frames and views.

问题

  1. How are the hyperparameters of the “progressive foreground enhancement” determined? Is it calculated using the distribution of different categories?

  2. Will the appearance of the same objects change across different frames and views? Because the MPI encoder only provides semantic layout, not texture features, and there is no strict geometry constraint.

局限性

None

作者回复

We greatly appreciate the careful proofreading of our paper. We would like to thank the reviewer for the valuable feedback and for appreciating our clear writing. In the following, we reply to individual questions and comments raised by the reviewer:

(1) W1. Geometry Constraint

In our understanding, the reviewer posits that the absence of explicit geometric constraints precludes the diffusion model from generating strictly consistent images, a perspective we respectfully find to be inaccurate. By implicitly learning geometry-aware image generation from 3D conditions, our diffusion model is trained to produce images that are aligned with the provided conditional input. While the consistency between the condition and the image may not be strictly guaranteed like a 3D representation (mesh or NeRF), the underlying label-image consistency is sufficiently robust. As evidenced by our data augmentation experiments, the generated image is already effective and valuable for downstream tasks.

As for view and frame generation, we additionally learn cross-view attention and cross-frame attention for consistent generation as described in our paper Sec. 3.4. The two attention modules are designed to enable the target view to access information from its neighboring views, specifically from the left and right views, or history frames. This feature flow facilitates the synthesis of coherent and contextually relevant views, thereby enhancing the overall consistency of the generated content.

(2) W2, Q1. Hyperparameters

The assumption that more parameters, are more difficult to tune may be inaccurate. In practice, we find that the hyperparameters m and n in Eq. 6 do not need to be carefully tuned. n is the training iterations that do not need adjustment. m is the maximum weight coefficient. Empirically, we’ve tried m=2 and m=3, both of them yield nearly equivalent performance. The hyperparameter is heuristically designed and shared across different datasets and categories.

(3) W3. Video Demonstration

We have provided generated video in Figures 11, 12, and 16. To clearly demonstrate and compare the consistency across different views and frames, we present the videos frame by frame. It is important to acknowledge that there is still room for improvement in our consistency, particularly as we fine-tune our cross-frame attention with a limited and small dataset. In the future, we plan to harness the advanced video foundation model such as (open) Sora or its alternatives to improve consistency.

Finally, we would like to emphasize that our primary contribution does not lie in frame-consistent or view-consistent generation. Our work primarily focuses on precise 3D controllability for image generation. In our implementation, the frame and view attention modules are analogous to our baseline method MagicDrive. Although we do not show significant improvement in consistency, the performance of our model is sufficient and adequately robust for downstream application, as demonstrated through our data augmentation experiments. We will continue to enhance the multi-view and frame generation in future work.

(4) Q2. Appearance change

In various figures throughout our paper, such as Figure 5, 7, and 11, the objects and contents maintain consistent identities across different frames and viewpoints. This consistency is learned by cross-frame and cross-view attention within a classifier-free guidance framework. We acknowledge that there are minor artifacts in certain cases due to imperfect attention modules. In our implementation, the two attention modules are analogous to MagicDrive. Although we do not show significant improvement in view consistency, the performance of our model is sufficient and adequately robust for downstream application, as demonstrated through our data augmentation experiments.


If there are any further concerns, we would be happy to address them in further discussion. Thank you!

评论

We thank the reviewer for the valuable comments. As the deadline for the rebuttal disscussion is approaching, we would greatly appreciate it if you could take the time to review our rebuttal at your earliest convenience.

Please let us know if there is any additional information we can provide or any further clarification needed. Thank you very much for your time and consideration. We look forward to your review.

审稿意见
6

In this paper, the authors propose a new controllable diffusion-based image generation method named SyntheOcc, which takes an occupancy map as input and generates camera images. SyntheOcc enables the application of scene editing and long-tail corner case generation and shows a strong capability of data augmentation for autonomous driving systems.

优点

  1. Compared with previous controllable image generation methods for traffic scenarios like Panacea or MagicDrive, the occupancy map contains more 3D spatial information than the BEV layout.
  2. The paper is well-organized and easy to follow.
  3. The extensive experimental results demonstrate the effectiveness of the proposed data generation pipeline.

缺点

  1. The control signal in Panacea or MagicDrive is BEV layout, which only contains lanes and foreground objects and is more easily acquired than occupancy. However, the SyntheOcc relies on sophisticated collected occupancy.

问题

It's interesting to note that diffusion models typically don't generate identical images in repeated runes, even when given the same control inputs. However, in Figure 1, we see that the generated images corresponding to both the original and edited occupancy maps show remarkably similar street structures. This raises an important question: Could this consistency be a result of overfitting to the nuScenes dataset? Diffusion models are known for their variability, but this particular case shows an unusual level of structural similarity. It's worth considering whether the model has perhaps learned the specific patterns of the nuScenes data too closely, leading to this unexpected consistency in output. This observation could have implications for the model's generalizability and its performance on diverse, real-world scenarios outside the training dataset.

局限性

The proposed SyntheOcc faces challenges in real-world application scenarios. For instance, to generate planning-level long-tail corner cases, other methods like Panacea or MagicDrive simply require editing the object's trajectory. However, SyntheOcc demands not only inputting the background occupancy but also constructing a pseudo occupancy for the foreground object. This raises the question of whether using occupancy as a control signal is an advantage or a disadvantage.

The crux of the issue lies in the complexity of this method compared to alternatives. While other approaches need some adjustments to object trajectories, this technique necessitates providing comprehensive background and foreground occupancy data. This additional overhead prompts us to ponder whether occupancy-based control offers tangible benefits or introduces unnecessary complications.

作者回复

We would like to thank the reviewer for the positive and detailed feedback. Below, we reply to individual comments and questions raised by the reviewer:

(1) W1. The reliance on occupancy introduces a minor burden but with enhanced controllability

We agree with the reviewer regarding the reliance of our method on more sophisticated occupancy data. Nevertheless, acquiring such data is not overly challenging. Occupancy data can be semi-automatically annotated using semantic point cloud and 3D bounding boxes (refer to [a]), which stands for an acceptable burden. On the other hand, condition on occupancy enables finer 3D controllability, so that we can edit object shapes and irregular background elements like drivable area and terrain. This level of control is challenging to achieve with previous 3D bounding boxes or Bird's Eye View (BEV) layouts.

[a] SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

(2) Q1. Similar structure in Fig.1: Controllability ensures our condition-image consistency

We provide several analyses of this effect:

  1. A similar structure in Fig.1 is an expected result. The first thing need to note is that, if we use the same random seed and condition input, the generated images in multiple runs will be the same. As we only modify a partial of occupancy input by placing traffic cones, the majority of occupancy data is unchanged. Besides, in implementation, we use the same random seed in two figures. Consequently, the generated images display similar street structures. In general, our diffusion model learns 3D controllable generation through a CFG paradigm [b], aligning with the frameworks used in text-to-image and ControlNet. Using pairs of conditions and images, our method learns controllable generation in a classifier-free manner.
  2. Empirically, it has been observed that our method does not overfit the nuScenes training set. As demonstrated in Figure 5 of our paper, when the same occupancy input is provided, our method produces a variety of images with consistent attributes and diverse appearances. This evidence indicates that our method does not overfit training images, as it maintains the ability to generate diverse yet aligned outputs.
  3. By employing a larger training dataset, our approach is expected to improve generalizability. We consider the nuScenes dataset, which comprises 200,000 images, to be of substantial scale. Utilizing this dataset for training, our model is more likely to learn the intrinsic relationships between the input conditions and the generated images, thereby mitigating the risk of unexpected consistency of overfitting.

[b] Classifier-Free Diffusion Guidance

(3) L1. User-friendly editing: disentangle the occupancy into foreground and background

To ease the editing, we provide a strategy that disentangles the foreground control and background control in occupancy data. If we want to edit a car’s trajectory, we can keep the background occupancy unchanged and select the car’s first frame occupancy using the 3D box. During the following frames, we remove foreground occupancy and simply place our foreground target’s occupancy in a certain location using trajectory. By doing so, we only add minor steps by using occupancy but provide more precise 3D control, which makes it a favorable choice for conditioning.


If there are any further concerns, we would be happy to address them in further discussion. Thank you!

作者回复

We appreciate the reviewers' constructive feedback and acknowledge their consensus on the merits of our work. We list the consensus among the reviewers below.

(1) Occupancy as condition: an innovative approach

Our work introduces 3D semantic Multi-Plane Images (MPIs) to capture both geometric and semantic details of a 3D scene using occupancy. Furthermore, we introduce the MPI encoder to effectively extract 3D information for the diffusion model. Our approach allows for the precise modeling of 3D environments in a 2D image synthesis context, enhancing the photorealism and depth accuracy of the generated images.

(2) Precise 3D control: enable various downstream application

Compared to prior work that relies on box-conditioning, our method achieves finer 3D control through the utilization of a semantic voxel. This innovative approach provides precise 3D control, thus broadening the horizon of potential applications. We provide three applications in our paper. First, it facilitates 3D scene editing, enabling users to make nuanced adjustments to 3D environments for image synthesis. Second, the method enables dataset generation, producing high-quality, diverse datasets that are essential for training downstream models. Third, it provides long-tailed scene generation, where it can synthesize rare scenarios, thereby enriching the diversity of training data and improving the generalization capabilities of the perception model.

(3) Thorough experiments validate our effectiveness

Our extensive experimental results demonstrate the effectiveness of the proposed generation pipeline across various settings, including label-image alignment and data augmentation. Notably, the pipeline excels in maintaining consistency between labels and images, ensuring that the synthesized data accurately reflects the intended annotations. This consistency is crucial for training robust models that can generalize well to real-world scenarios, thereby enhancing the reliability and applicability of the generated data in practical applications.

最终决定

The paper presents a diffusion model for generating images in autonomous driving scenarios using a new occupancy label conditioning, with semantic MPIs to represent the 3D geometric information. All reviewers appreciate the method and quantitative evaluations, but also point out the lack of demonstration of consistency across views and frames that one would expect from a 3D representation. A rebuttal was submitted, which argues that the main contribution of the paper is in generating images conditioned on 3D inputs in the form of occupancy labels, while consistency may be achieved by complementary progresses such as in video generation methods. The rebuttal was acknowledged by uu4q and RWNf in retaining their opinions that the work is useful to the community, while PrKG remains unconvinced that a method for 3D editing has utility without demonstrated consistency. The AC concurs with both the opinions that the paper introduces new ideas, but also lacks in careful positioning and demonstration.

On a careful reading of the paper, there are indeed claims and design choices such as in the abstract and Sec 3.4 for the proposed method to lead to improved consistency across views and frames. The current rebuttal indicates that the limited handling for this is derived from the abilities of MagicDrive, but it is already known to not lead to good quality in terms of consistency. The use of video generation methods to address this, as indicated in the rebuttal, is non-trivial and requires demonstration. The use of 3D control is also partially related to other forms of geometric controls used in the prior works that are noted by the paper. This must be weighed against the novelty of using occupancy labels for geometry control, which does require new methodological steps such as a multi-resolution encoding of semantic MPI information, but the overall framework remains similar to what has been explored in prior works.

Overall, the paper shows interesting initial results, but will benefit from a rewrite to focus more on the abilities gained through a new type of geometry control in images. Claims related to 3D abilities must be calibrated and a full evaluation for consistency across views and frames should be required for 3D representations. A video must be included as supplementary material for a thorough illustration of the generation quality. It is acknowledged that the quantitative results on the effectiveness of proposed data augmentation for downstream tasks is an important contribution and should be retained. The authors are highly encouraged to include these improvements and resubmit to a future venue. These will require a new round of reviews and the paper may not be accepted to NeurIPS at this time.