SyntheOcc: Synthesize Occupancy-Controlled Street View Images through 3D Semantic MPIs
Geometric Controlled Generation Framework for Autonomous Driving. High-quality 3D controllable videogeneration benefit various downstream tasks.
摘要
评审与讨论
This paper presents SyntheOcc, a diffusion-based framework for generating realistic, geometrically controllable street-level images and videos by conditioning on 3D semantic occupancy labels. SyntheOcc leverages 3D semantic multi-plane images (MPIs) to encode detailed geometric and semantic information, enabling precise scene editing, dataset augmentation, and long-tail scene synthesis for autonomous driving applications. Experiments on nuScenes demonstrate its superiority over prior methods in terms of image quality, 3D controllability, and performance on downstream tasks such as occupancy prediction.
优缺点分析
Strength:
(1) SyntheOcc shows great potential in addressing the challenges of long-tail distributions and enriching datasets, providing strong support for downstream practical applications. (2) This paper establishes a rigorous experimental framework and conducts extensive qualitative and quantitative analysis. Empirical results confirm that the method outperforms existing methods in both real data to simulated data evaluation and simulated data to real data data augmentation. Further ablation studies confirm key design decisions. (3) The paper has very clear and detailed explanations, intuitive visualizations, and logical structures that enhance understanding and reproducibility.
Weakness: (1) Limited video generation quality: SyntheOcc’s generated results have noticeable artifacts, especially when rendering lane lines on roads (as shown in Figure 11), where lane lines disappear on some images. These inaccuracies may have an adverse impact on subsequent tasks such as motion planning, where accurate scene understanding is critical. (2) Limited complexity of object editing: The current experiments mainly demonstrate scene editing using simple, regularly shaped objects. To better evaluate the generalization ability of the framework, more complex and irregular objects should be tested in tasks such as relocalization or insertion. (3) Unclear dataset expansion effect: The paper claims that SyntheOcc increases the diversity of the dataset, but Table 9 shows that too much synthetic data can degrade the performance of the model. Therefore, the proposed method may not be very effective for expanding the dataset. (4) Recent work on image or video generation in autonomous driving should be included in the comparison method, such as UniScene[1].
Li B, Guo J, Liu H, et al. Uniscene: Unified occupancy-centric driving scene generation. CVPR, 2025
问题
please address the questions in weakness section.
局限性
yes
最终评判理由
Some of the issues I raised have been addressed, but there are still some unresolved problems (w3). The reviewer cannot agree with the authors' claim. Since the purpose of controllable generation methods is to produce the data "we want," data that fails to function effectively cannot be considered as "what we want." Of course, we understand the gap between generated and real data, but the reviewer did not see what efforts the authors have made in this regard. Empirical parameter tuning is difficult to serve as an effective and general method. However, the reviewer also understands that such issues are difficult to resolve within the scope of this paper, and therefore is willing to raise the score to 4.
格式问题
NA
We sincerely thank the reviewer for the thoughtful and constructive questions. Please find our detailed responses below:
W1. Video Generation Quality
We appreciate your thoughtful feedback. This phenomenon is expected given the nature of the scene. The effect observed in certain scenarios—such as the discontinuities in lane line rendering in the upper part of Figure 11—primarily arise from the presence of dashed road markings, which inherently result in intermittent visual patterns when extracting consecutive frames. We kindly refer you to the supplementary videos, which further illustrate the temporal consistency and overall quality of our video generation results.
W2. Complexity of Object Editing
A key contribution of our work is enabling controllable editing for a wide variety of geometric shapes. As shown in Figure 7e, our framework can generate complex and irregular objects, such as excavators, demonstrating the model’s capability to handle challenging object geometries. Since we are unable to add new visual results during the rebuttal period, we will expand our experiments to systematically assess the generalization ability of our method on more complex and irregular object editing tasks in future work.
W3. Data Augmentation
We appreciate your insightful observation regarding Table 9, which indicates that an excessive amount of synthetic data can negatively impact model performance. Our analysis attributes this phenomenon to the distribution gap between synthetic and real-world data; over-reliance on synthetic samples may introduce distributional noise and reduce generalization. To address this, we recommend carefully balancing the ratio of synthetic to real data in practical applications. Additionally, strategies such as data filtering and adaptive sampling can be employed to enhance the quality of the training set. In the revised manuscript, we will provide further experimental analysis on the effects of different data mixing ratios and discuss methods for improving data quality.
It is important to note that the primary focus of our work is on controllable generation, rather than the synthetic data itself. We believe our contributions in this area are substantial. Nevertheless, we recognize the value of synthetic data for downstream tasks and plan to explore its broader applications in future work.
W4. Additional Related Work
We have provided discussions of methods such as UniScene[1] in the related work section. We appreciate the reviewer's comparison with UniScene. However, it is important to note that UniScene operates as an image-to-video (i2v) model that requires an additional initial frame as input (the Zc conditional reference frame in Eq.8 of the UniScene paper). Because of this difference in required inputs, comparing FID scores is neither reasonable nor fair (GT leaking: simply repeating the initial frame n times would yield decent FID scores). As a follow-up work to ours (UniScene was released after our arXiv preprint), the condition design of UniScene actually aligns with our ControlNet+depth experiment in Table 1, which uses semantic+depth for control. Under aligned settings in Table 1, our MPI approach achieved significant improvement over UniScene's semantic+depth approach (25.5 vs 17.5 mIoU).
Below we provide a comparison of mIoU and FID between our method and Semantic + Depth under aligned settings:
| Method | Input Condition | mIoU | FID |
|---|---|---|---|
| Ours (MPI) | MPI | 25.5 | 14.75 |
| ControlNet+depth (UniScene) | Semantic + Depth | 17.5 | 19.12 |
As shown in the table, our MPI-based approach achieves a better mIoU and FID compared to Semantic + Depth, demonstrating the effectiveness of our method in geometry control.
We would also like to point out that UniScene adopts a lossy representation (single plane), which inevitably leads to some information loss during the condition generation process. In contrast, our approach leverages a more faithful and information-preserving representation, resulting in improved controllability and fidelity in the generated scenes.
Thank you again for your valuable suggestions, which will help us further improve and refine our work.
Some of the issues I raised have been addressed, but there are still some unresolved problems (w3). The reviewer cannot agree with the authors' claim. Since the purpose of controllable generation methods is to produce the data "we want," data that fails to function effectively cannot be considered as "what we want." Of course, we understand the gap between generated and real data, but the reviewer did not see what efforts the authors have made in this regard. Empirical parameter tuning is difficult to serve as an effective and general method. However, the reviewer also understands that such issues are difficult to resolve within the scope of this paper, and therefore is willing to raise the score to 4.
We sincerely appreciate the reviewer’s valuable comments and in-depth discussion.
We fully understand and agree with the reviewer’s concerns regarding the effectiveness of mixing synthetic and real data during downstream task training.
Due to limitations in computational resources and time during rebuttal, we are difficult to conduct a systematic and comprehensive study on this issue. However, we have carried out preliminary explorations and plan to further address this topic in future work. Specifically, we propose the following directions for improvement:
(1) dynamic data sampling, where the ratio of synthetic to real data is adaptively adjusted according to the training stage of the model to mitigate the impact of distributional differences;
(2) bilevel core set selection, which aims to identify and select the most valuable synthetic samples for downstream tasks, thereby improving data utilization efficiency.
We hope these approaches have the potential to further narrow the performance gap between synthetic and real data, and we will conduct more systematic empirical analyses in our future work.
The paper proposes SyntheOcc, a method that generates street-view images conditioned on a 3D semantic occupancy map. The core idea is to use multiple image planes (MPIs) to encode the 3D semantic scene, which are then used as conditioning for a diffusion model. This adds conditioning on the depth dimension and occluded scenery, allowing for better control over the generated images. The paper also proposes several data re-weighting strategies, including progressive foreground enhancement, depth-aware foreground re-weighting, and class-balanced grouping and sampling. These strategies improve supervision on foreground regions, particularly those at a distance, while better handling long-tailed classes. The method is shown to be effective for data augmentation and generates images with better FID scores than previous methods on the nuScenes dataset.
优缺点分析
Strengths:
-
The use of multiple image planes (MPIs) to create 3D semantic scene conditioning for a diffusion model is a novel approach. This provides conditioning on both depth and occluded scenery, allowing for greater control over the generated images.
-
The paper is well-written and clearly explains its core ideas, with the exception of the related work section.
-
The method proves useful for data augmentation, improving mIoU by approximately one point, and generates images with superior FID scores compared to previous methods on the nuScenes dataset.
Weaknesses and Questions
-
The main advantage of multiple image planes (MPIs) over standard voxel projection is their ability to encode occluded regions. To clarify this effect, the authors should include an ablation study where only visible voxels are fed to the network (i.e., by masking all occluded pixels in all planes). On line 250, the authors state, "The ControlNet+depth can be regarded as a degradation of SyntheOcc which is reduced to a single plane." I disagree with this comparison, as the differing architectures of SyntheOcc and ControlNet make it difficult to isolate whether the performance gains truly stem from the modeling of occluded regions.
-
The related work section is lacking in its coverage of 3D occupancy prediction and diffusion-based image generation. The section on 3D occupancy prediction lists only four seemingly random references. The part on diffusion-based generation discusses only Stable Diffusion (CVPR'22) and ControlNet (CVPR'23), offering limited insight and suggesting an incomplete literature review. Additionally, the introduction (line 24) fails to cite seminal works in camera-based 3D occupancy prediction, such as MonoScene (CVPR'22), which pioneered the approach.
-
While SyntheOcc can be used for data augmentation, the resulting improvement of ~1 mIoU is limited. Furthermore, the training process is computationally expensive (requiring 8 A100 GPUs for one day), not to mention the inference time. This makes its practical application for data augmentation questionable.
-
Figure 7a is unconvincing because the comparison is made with a ControlNet model that lacks depth information. The ambiguity in the example could likely be resolved by conditioning on depth.
-
It is unclear whether text prompts are used during training. If so, the paper should explain how these prompts are generated.
-
The "Gen Val" setting in Table 1 seems to create an unfair comparison. The validation set images are generated by SyntheOcc, and all methods, including SyntheOcc itself, are evaluated on this set. This inherently advantages SyntheOcc.
-
The "Progressive Foreground Enhancement" and "Depth-aware Foreground Reweighing" modules seem heavily hand-crafted. The paper does not provide clear guidance on how to set the parameters for these modules.
问题
Questions are included alongside weaknesses to better contextualize.
局限性
Yes
最终评判理由
Most of my concerns, as well as those raised by other reviewers, have been addressed. I am now convinced by the use of MPI for conditioning image/video generation on 3D occupancy, for the following reasons
-
The authors provided experiments demonstrating MPI's advantage in encoding occlusions.
-
The "Gen Val" setting has been clearly clarified.
-
Parameter settings for both the Progressive Foreground Enhancement and Depth-aware Foreground Reweighting modules are now clearer.
-
I somewhat disagree with reviewer
H8Lu’s assertion that 3D MPIs are not "lossless". I think it still holds within the camera frustum. -
Reviewer
vtxp’s concerns appear to be resolved and acknowledged by the reviewer. -
I still share reviewer
9Zfe's concern regarding the effectiveness of synthetic data augmentation. Specifically, the claim that SyntheOcc improves dataset diversity is undermined by Table 9, which shows that excessive synthetic data may degrade model performance (Weakness 3). -
The related work section requires substantial revision, the authors have committed to improving it and I trust the authors.
-
While I agree with reviewer
H8Luthat a comparison with UniScene would be useful, that work was only recently accepted at CVPR’25, which is after the NeurIPS submission deadline. Given their concurrent timing, the lack of comparison is understandable and should not be grounds for rejection. This paper also provides a valuable alternative approach for 3D conditioning.
Thus, my final recommendation is Accept.
格式问题
No concern
W1. Regarding the comparison between multiple image planes (MPIs) and standard voxel projection
We greatly appreciate your suggestions. We agree that the main advantage of MPIs lies in their ability to encode occluded regions. Regarding the statement that "ControlNet+depth can be regarded as a degradation of SyntheOcc reduced to a single plane," we acknowledge the architectural differences between the two approaches. We have conducted additional experiments to align the MPI encoder and re-weighting components, with the following results:
| Experiment Name | MPI Encoder | Re-weighting | Input Voxel Region | Notes | mIoU |
|---|---|---|---|---|---|
| SyntheOcc (Full) | ✓ | ✓ | All voxels | Standard method with occluded regions | 25.5 |
| SyntheOcc (single-plane) | ✓ | ✓ | Visible voxels only | Retaining MPI encoder and re-weighting | 19.1 |
| ControlNet+Depth | ✗ | ✗ | Visible voxels only | Original ControlNet | 17.5 |
Through these experimental configurations, we can systematically analyze the contribution of MPIs to occluded region modeling and the impact of architectural differences on performance. We found that both the intermediate representation of MPIs and the network structure of the MPI encoder contribute to improvements in the final results.
W2. Regarding the supplementation of related work
We thank you for your corrections. We will supplement more literature on 3D occupancy prediction and diffusion-based image generation in future versions, including seminal works such as MonoScene (CVPR'22), and add citations to these works in the references section to improve the literature review.
W3. Regarding the practical effectiveness of data augmentation and computational resource consumption
We acknowledge that the current improvement is limited and training costs are high. We believe that SyntheOcc's main contribution lies in proposing a new controllable generation paradigm, with data augmentation being only one of its potential applications. In the future, we will explore more efficient training and inference schemes to enhance its practicality.
W4. Regarding the comparison in Figure 7a
We found that simply adding depth information cannot fundamentally solve this problem. The possible reason is that the occupancy resolution is relatively small and cannot precisely reflect depth relationships, while MPIs can completely preserve the length and width information of objects, thereby maintaining correct object poses in the generated results. Since rebuttal cannot provide new visualizations, we will include this comparison in future versions.
W5. For video generation, we used a VLM for annotation, specifically employing the cogvideox annotation tool cogvlm2-llama3-caption for video labeling. We will supplement this explanation in the revised manuscript.
W6. Regarding the fairness of the "Gen Val" setting in Table 1
We thank you for your attention. In fact, we used different models to generate different validation sets. For example, we used MagicDrive and ControlNet to generate corresponding validation sets, and then evaluated them with FB-OCC. Under these circumstances, the comparison is fair.
W7. Regarding the parameter settings for the "Progressive Foreground Enhancement" and "Depth-aware Foreground Reweighing" modules
We acknowledge that these modules have some hyperparameters, but we found that making small adjustments to the hyperparameters within a certain range has minimal impact on the results.
To further validate the impact of parameter changes on model performance, we conducted ablation experiments on the weight parameter m in the two modules. Due to computational resource constraints, we experiment with a few parameters. The table below shows the mIoU performance of the model on the validation set under three different m values in Eq. 8:
| m value | 2.0 | 2.5 | 3.0 |
|---|---|---|---|
| mIoU | 25.56 | 25.54 | 25.53 |
As m increases, we observe that the mIoU experiences a slight decrease; however, the overall performance of the modules remains relatively stable across different m values. This indicates that our method is robust to changes in this hyperparameter, and small variations in m do not significantly affect the final results.
We thank you again for your valuable suggestions, and we will supplement and improve the above issues in future versions. We welcome further discussion!
Thanks authors for their detailed response and the considerable effort invested in conducting these experiments, especially within such a short timeframe.
Does the MPI's advantage lie in the ability to encode occlusion?
I am convinced by the provided results, which confirm the benefits of MPI for modeling occluded scenery, given the significant performance degradation when retaining only visible voxels in the MPI. Thank you for providing experiments that isolate the architectural effects. I am convinced by the results. However, the authors should still modify the incorrect statement: "ControlNet+depth can be regarded as a degradation of SyntheOcc reduced to a single plane."
Poor related works section
I trust that the authors will significantly improve the related work section to better connect and contextualize their work within the existing literature. Not directly relevant, but the paper reminds me of SceneRF (ICCV'23), where they adapt "MINE: Towards Continuous Depth MPI with NeRF for Novel View Synthesis," -- which is a MPI-based method -- for self-supervised occupancy prediction. This might be worth mentioning as it provides additional context for MPI applications in 3D occupancy prediction tasks.
Regarding the practical effectiveness of data augmentation and computational resource consumption
I agree with the authors that "SyntheOcc's main contribution lies in proposing a new controllable generation paradigm, with data augmentation being only one of its potential applications." However, this might be worth mentioning in the limitations section.
Figure 7a is unconvincing due to comparison with ControlNet lacking depth information
The authors stated that "adding depth information cannot fundamentally solve this problem. The possible reason is that the occupancy resolution is relatively small and cannot precisely reflect depth relationships." This explanation is not convincing, as one could use raw depth without discretizing into occupancy. Given that the authors have a "ControlNet+depth" baseline in Table 1, it remains unclear why this wasn't used directly for Figure 7a instead of only "ControlNet."
The "Gen Val" setting in Table 1 creates potentially unfair comparisons
The authors clarify: "we used different models to generate different validation sets. For example, we used MagicDrive and ControlNet to generate corresponding validation sets, and then evaluated them with FB-OCC. Under these circumstances, the comparison is fair." Please add this crucial information to the experiments section, as it is currently missing from the paper.
Regarding parameter settings for "Progressive Foreground Enhancement" and "Depth-aware Foreground Reweighing" modules
When varying hyperparameters, the mIoU changes very slightly (25.56 → 25.54 → 25.53). Does this indicate that these modules have minimal impact? If hyperparameters have negligible effects, what justifies introducing these components?
Additional concerns from other reviewers
I find several concerns raised by other reviewers quite compelling:
- The comparison between 3D MPIs vs. full 3D occupancy volume conditioning (e.g., UniScene) raised by reviewers
H8Luand9Zfe. While comparison with UniScene is not mandatory given its very recent publication at CVPR'25, the discussion should still be addressed. - While SyntheOcc increases dataset diversity, Table 9 demonstrates that excessive synthetic data can degrade model performance (reviewer
9Zfe)
I would like to see the discussion from other reviewers before making my final decision. Overall, I remain positive about this paper.
Reviewer Comments and Our Responses
We sincerely thank the reviewer for their constructive feedback and thorough evaluation of our work. We will address each concern systematically and incorporate these suggestions to improve our manuscript.
1. Regarding MPI's advantage in encoding occlusion
We appreciate the reviewer's acknowledgment of our experimental results demonstrating the benefits of MPI for modeling occluded scenery. We will modify the statement regarding ControlNet+depth as suggested by the reviewer.
2. Related works section
We acknowledge the reviewer's concern about the related works section. We will significantly expand and improve this section to better contextualize our work within the existing literature, providing more comprehensive coverage of relevant prior work.
3. Practical effectiveness and computational considerations
We agree with the reviewer's suggestion to include a discussion of limitations regarding data augmentation and computational resource consumption. We will add a dedicated limitations section to address these practical considerations.
4. Figure 7a comparison methodology
There is an objective limitation for raw depth: we cannot obtain sharp depth information directly from occupancy representations. Raw depth without discretization into occupancy is extremely difficult to acquire, as dense depth can only be obtained through graphics engines or depth estimators, not through occupancy grids. Depth derived from occupancy is inherently coarse. Therefore, from a fair comparison perspective, when using only occupancy representations, MPI representation outperforms depth-based approaches.
5. "Gen Val" setting clarification
We will add a detailed explanation of the "Gen Val" setting to the experiments section. As clarified in our response, we used different models to generate corresponding validation sets (e.g., MagicDrive and ControlNet), then evaluated them with FB-OCC. This methodology ensures fair comparison across different approaches.
6. Parameter settings for Progressive Foreground Enhancement and Depth-aware Foreground Reweighing modules
We have extended our analysis by including the case of m=1 in the table, which serves as the baseline configuration without the Progressive Foreground Enhancement and Depth-aware Foreground Reweighting modules (as reported in the original Table 3). Our experimental results indicate that increasing m from 1 to 2 leads to a substantial improvement in performance, demonstrating the effectiveness of these modules. However, further increasing m beyond 2 does not yield additional gains, and the performance tends to saturate, with the optimal value observed around m=2. Due to computational and time constraints, we did not explore values of m between 1 and 2. We hypothesize that as m increases from 1, the mIoU improves correspondingly, but when m exceeds 2, the performance gain becomes marginal and approaches saturation.
| m value | 1.0 | 2.0 | 2.5 | 3.0 |
|---|---|---|---|---|
| mIoU | 23.05 | 25.56 | 25.54 | 25.53 |
7. Comparison with UniScene
We note that UniScene employs semantic+depth conditioning rather than full 3D occupancy volume conditioning. UniScene adopts a single-plane representation, which introduces information loss during the condition generation process. In contrast, our multi-plane approach may preserve geometric information more effectively, contributing to enhanced controllability and fidelity in the generated scenes.
Summary
We appreciate the reviewer's positive overall assessment and will incorporate all suggested improvements to enhance the quality and clarity of our manuscript. We believe these revisions will significantly strengthen our contribution to the field. We welcome any further discussion!
Thank you for your detail response.
I agree with all points except the statement: "we cannot obtain sharp depth information directly from occupancy representations. Raw depth without discretization into occupancy is extremely difficult to acquire, as dense depth can only be obtained through graphics engines or depth estimators, not through occupancy grids. Depth derived from occupancy is inherently coarse. Therefore, from a fair comparison perspective, when using only occupancy representations, MPI representation outperforms depth-based approaches." As you only need depth as input to StableDiffusion, using depth from Existing methods like DepthAnythingV2 is already enough.
Anyway, this point is not very critical in my opinion.
However, I agree with reviewer 9Zfe about the weakness 3: "Unclear dataset expansion effect: The paper claims that SyntheOcc increases the diversity of the dataset, but Table 9 shows that too much synthetic data can degrade the performance of the model.
Most of my concerns, as well as those raised by other reviewers, have been addressed. I am now convinced by the use of MPI for conditioning image/video generation on 3D occupancy, for the following reasons
-
The authors provided experiments demonstrating MPI's advantage in encoding occlusions.
-
The "Gen Val" setting has been clearly clarified.
-
Parameter settings for both the Progressive Foreground Enhancement and Depth-aware Foreground Reweighting modules are now clearer.
-
I somewhat disagree with reviewer #H8Lu’s assertion that 3D MPIs are not "lossless". I think it still holds within the camera frustum.
-
Reviewer
vtxp’s concerns appear to be resolved and acknowledged by the reviewer. -
I still share reviewer
9Zfe's concern regarding the effectiveness of synthetic data augmentation. Specifically, the claim that SyntheOcc improves dataset diversity is undermined by Table 9, which shows that excessive synthetic data may degrade model performance (Weakness 3). -
The related work section requires substantial revision, the authors have committed to improving it and I trust the authors.
-
While I agree with reviewer
H8Luthat a comparison with UniScene would be useful, that work was only recently accepted at CVPR’25, which is after the NeurIPS submission deadline. Given their concurrent timing, the lack of comparison is understandable and should not be grounds for rejection. This paper also provides a valuable alternative approach for 3D conditioning.
Thus, my final recommendation is Accept.
We are pleased that our rebuttal has addressed the reviewer’s major concerns. We would like to express our sincere gratitude for the reviewer’s valuable comments and insightful discussion.
Regarding the depth issue, our perspective is that methods such as Depth Anything require an RGB image as input to predict depth. However, in scenarios where we aim to generate scenes that do not exist in the real world, such an RGB image is unavailable, making depth prediction infeasible. In contrast, our approach allows for generation based on manually edited occupancy or occupancy generated by generative models, thus enabling controllable scene synthesis even in the absence of real RGB images and depth.
With respect to synthetic data augmentation, we acknowledge that this is a challenging task and our current work represents a preliminary exploration. We propose several potential solutions for future research: (1) dynamic data sampling, in which the ratio of synthetic to real data is adaptively adjusted according to the training stage to mitigate the impact of distributional differences; and (2) bilevel core set selection, which aims to identify and select the most valuable synthetic samples for downstream tasks, thereby improving data utilization efficiency. We hope these approaches have the potential to further narrow the performance gap between synthetic and real data, and we will conduct more systematic empirical analyses in our future work.
Finally, it is important to emphasize that the primary focus of our work is on controllable generation, rather than synthetic data and augmentation itself. We believe our contributions in this area are substantial. Nevertheless, we recognize the value of synthetic data for downstream tasks and plan to explore its broader applications in future work.
We sincerely thank the reviewers for their valuable feedback and constructive suggestions. We will continue to improve our method and further explore these important issues in future work.
Thank you for your responses. You're right. I overlooked the fact that depth prediction isn’t feasible without an RGB image. My apologies.
I also agree with your point about synthetic data generation. It highlights the limitations of existing methods and suggests a promising direction for future research.
This paper introduces SyntheOcc, a novel generative framework for synthesizing realistic, controllable street-view images and videos conditioned on 3D semantic occupancy grids. The key innovation is using 3D Semantic Multi-Plane Images (MPIs) as conditioning input for a diffusion model, effectively encoding detailed 3D geometry and semantics while maintaining pixel-aligned relationships. The framework includes a specialized MPI encoder and re-weighting strategies to improve foreground object generation in imbalanced driving scenes. SyntheOcc demonstrates effective data augmentation capabilities for improving downstream perception tasks.
优缺点分析
Strengths:
- Utilizing multiplane images (MPIs) as representation provides a novel method to encode 3D semantic occupancy.
- Addresses critical data scarcity problem for dense 3D perception using 3D semantic MPIs, bridging 3D occupancy and 2D image synthesis with geometric soundness and occlusion handling.
- Through manipulating voxels, SyntheOcc is capable of creating diverse novel scenes, which is vital for downstream tasks such as scene editing and data augmentation.
Weaknesses:
- While 3D semantic occupancy provides good geometric control, compared to using HD maps and bounding boxes as input conditions, the control over traffic element types and scene editing convenience remains limited. For example, control over background elements such as crossings, as well as lane width and number of lanes, is still limited.
- It would be better to provide more test cases such as rotating the ego vehicle, translating it to other lanes and cutting in scenarios to validate the spatial generation consistency of the proposed method, further demonstrating the potential for practical applications.
- The current FVD comparison lacks corresponding evaluation settings and detailed assessment protocols, making it difficult to fully assess the quality of generated videos.
问题
Please address the concerns raised in the weaknesses section.
局限性
yes
最终评判理由
The authors have addressed my concerns. Given that I initially assigned a positive rating and considering the other reviewers’ comments, I am inclined to maintain my original score.
格式问题
NA
We sincerely thank the reviewer for the detailed evaluation and valuable suggestions. We address each of the raised concerns as follows:
W1. Control property of 3D semantic occupancy
Our method mainly relies on the semantic occupancy annotations provided by the nuScenes dataset, which currently do not include finer-grained traffic element information. If more detailed annotations become available in the future, our method can seamlessly support more precise scene control.
Our objective and contribution is not to achieve lane-level control, but rather to propose a lossless intermediate representation that extracts geometric and semantic features from occupancy. Currently, the semantic occupancy based on nuScenes annotations does not include lane-level annotations. However, if such annotations were available, our methodology would be fully capable of performing lane-level controllable generation. This property is not a disadvantage of our methodology, but rather a characteristic of the dataset itself.
W2. More test cases
Thank you for your suggestion. As we are unable to add new experiments during the rebuttal period, we kindly ask you to refer to our supplementary materials. In video1.mp4, the two videos in the upper right corner (e.g., row 2, column 3) demonstrate ego vehicle turning and rotation scenarios, where our model successfully generates plausible videos. As we can not add new figure during rebuttal, we will include additional spatial consistency validation cases, such as lane changing and cut-in scenarios, in future versions to further demonstrate the practical potential of our method.
W3. FVD comparison
We used the same FVD evaluation code as MagicDrive to compute FVD metrics between the generated videos and the ground truth videos. The specific implementation can be found in the open-source repository of MagicDrive. We have also provided our generated videos in the supplementary materials for reference, so the generation quality can be intuitively observed. In future versions of the paper, we will include more detailed descriptions of the evaluation process and protocols to further improve the transparency and reproducibility of our evaluation.
Once again, we thank the reviewer for the constructive feedback, which will greatly help us improve and refine our work. We welcome further suggestions and discussion.
Thanks for the responses. It resolves my concerns. I will keep my rating.
The paper presented SyntheOcc, an image and video generation method conditioned on 3D occupancy, to generate photorealistic images. A key component of the proposed method is to represent 3D occupancy as a stack of Multi-Plane Images (MPIs), preserving geometric and semantic information. Certain reweighting strategies during training time are also proposed to address the class imbalance problem.
优缺点分析
[Strengths]
- Representing 3D occupancy as MPIs is a good attempt to preserve geometry and semantic information, which is important for image and video generation. Such representation also has the potential to lead to better efficiency.
- Compared to previous methods, the video generation results in terms of FID/FVD show modest improvement.
[Weaknesses]
- I don't think the claim of 3D MPIs is a lossless representation of 3D scenes is correct. For one, as described in Eq. 1-2 in Section 3.2, for entities outside by definition cannot be captured by such MPIs, leading to information loss. For another, MPIs are limited by viewpoints, which means handling all occlusions is impossible. Hence this claim "lossless representation" is misleading.
- It is not clear to me how much benefit of using 3D MPIs brings. For example, in UniScene[1] which the authors also surveyed in Section 2.3, directly conditions on the full 3D occupancy volume for image and video generation, and demonstrated an FID score of 6.12 (Table 4 in UniScene paper) on nuScenes, far outperforms proposed method's 14.75.
- Inference runtime analysis is missing. It would very beneficial to include them since that's a critical aspect of evaluating image/video generation approaches especially in terms of practicality.
[1] Li, Bohan, et al. "Uniscene: Unified occupancy-centric driving scene generation." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
问题
- How is Eq. 8 applied to reweighting loss incurred from foreground objects in Eq. 9? Is it done through masking (by using GT?)
- For the "Gen" validation set in Table. 1, is it generated by the proposed method?
局限性
Yes
最终评判理由
Authors' rebuttal addressed most of my concerns. Hence I improve my rating to "borderline accept".
格式问题
No concern regarding paper formatting.
We sincerely thank the reviewer for the thoughtful and constructive questions. Please find our detailed responses below:
W1. On the “lossless representation” claim
We appreciate the reviewer's careful consideration of our "lossless representation" claim. We would like to clarify that our approach maintains the same resolution as the input occupancy grid, where dmin and dmax correspond directly to the occupancy voxel boundaries. Furthermore, our use of surround-view cameras enables 360-degree coverage, ensuring comprehensive capture of all occupancy within the camera frustum. Regions outside the camera field of view are naturally excluded from generation requirements. We acknowledge that our original terminology may have been imprecise and will revise our manuscript to more accurately reflect this perspective.
W2. Benefit of using 3D MPIs
We appreciate the reviewer's comparison with UniScene. However, it is important to note that UniScene operates as an image-to-video (i2v) model that requires an additional initial frame as input (the Zc conditional reference frame in Eq.8 of the UniScene paper). Because of this difference in required inputs, comparing FID scores is neither reasonable nor fair (GT leaking: simply repeating the initial frame n times would yield decent FID scores). As a follow-up work to ours (UniScene was released after our arXiv preprint), the condition design of UniScene actually aligns with our ControlNet+depth experiment in Table 1, which uses semantic+depth for control. Under aligned settings in Table 1, our MPI approach achieved significant improvement over UniScene's semantic+depth approach (25.5 vs 17.5 mIoU).
Below we provide a comparison of mIoU and FID between our method and Semantic + Depth under aligned settings:
| Method | Input Condition | mIoU | FID |
|---|---|---|---|
| Ours (MPI) | MPI | 25.5 | 14.75 |
| ControlNet+depth (UniScene) | Semantic + Depth | 17.5 | 19.12 |
As shown in the table, our MPI-based approach achieves a better mIoU and FID compared to Semantic + Depth, demonstrating the effectiveness of our method in geometry control.
We would also like to point out that UniScene adopts a lossy representation (single plane), which inevitably leads to some information loss during the condition generation process. In contrast, our approach leverages a more faithful and information-preserving representation, resulting in improved controllability and fidelity in the generated scenes.
W3. On inference runtime analysis
We appreciate the reviewer’s suggestion regarding inference efficiency. Our experiments indicate that the average image generation time is 1 second per frame, while video generation (49 frames) requires approximately 50 seconds. These results were obtained on an NVIDIA A100 GPU.
| Task Type | Average Inference Time (per frame) | Average Inference Time (per video, 49 frames) |
|---|---|---|
| Image Generation | 1 second | - |
| Video Generation | - | 50 seconds |
We will incorporate these experimental results into the revised manuscript to provide a more comprehensive assessment of the practical applicability of our method.
Q1. Application of Eq. 8 for reweighting loss in Eq. 9
A1. Yes, the reweighting is implemented via masking, utilizing the ground truth (GT) to guide the denoising objective.
Q2. “Gen” validation set in Table 1
A2. Yes, the “Gen” validation set refers to samples generated by our proposed method. We evaluate the alignment between the input occupancy condition and the generated images, and further assess perception accuracy on this generated validation set using a pretrained detector.
We thank you again for your valuable suggestions, and we will supplement and improve the above issues in future versions. We welcome further discussion!
Thank you for the responses. While I'm still not entirely convinced regarding the comparison with UniScene, given it's a concurrent work this issue is not critical. Most of my concerns have been addressed hence I'm willing to raise my rating.
Thank you very much for your valuable feedback and for taking the time to review our manuscript. We are pleased that we have carefully addressed the concerns and questions you raised in your review.
As you’d mentioned an update to your overall rating, we wanted to gently remind you that you do have the option to adjust your review rating accordingly (edit button). We’ve noticed that the system still shows the original rating of 3 at this time, and it seems no changes to the rating have been made just yet.
Thank you again for your contribution to improving our work.
The paper proposes to encode occupancy maps for synthesizing images of driving scenarios. To that end, it proposes to encode the occupancy maps as MPIs and designs an encoder of MPIs for diffusion models. It further proposes several re-weighting strategies to improve the encoding.
The reviewers acknowledged the MPI representation as suitable for encoding occupancy, modest improvements in video generation results, a mostly clear explanation (except for the related work), and usefulness for data augmentation.
Several concerns were raised, with the most severe issues being a demonstrated benefit of MPIs over 3D occupancy or single-plane representations, occupancy maps providing an inconvenient editing experience, a missing demonstration of consistency across spatial edits, limited quality of generated videos, missing related work, and missing guidance on tuning the heuristic re-weighting strategies.
Most of these concerns were rebutted and discussed with reviewers, and reviewers subsequently scored 3 BA and 1 A. The AC read the paper, rebuttal, and subsequent discussion, but does not follow the reviewers' recommendation.
In the opinion of the AC, there are multiple factors that keep the paper from passing the bar of acceptance:
- The choice of occupancy maps seems highly impractical for actual editing tasks, as pointed out by reviewers. The occupancy maps as presented in the paper appear hard to manipulate and especially so at scale. A comparison to prior work that uses HD Maps and bounding box could have addressed these concerns, but such a comparison to those approaches was only made verbally. The fact that occupancy maps in this paper could be derived directly from the nuScenes ground truth may be convenient for data processing, but does not provide a clear path towards easy scaling of the synthesis or editing of these maps beyond recycling the ground truth.
- The use for data augmentation is severely limited by the computational resources and time required for inference (and training). Reported results show only minor improvements as noted by the reviewers.
- The related work section is insufficient for a paper to be be submitted at NeurIPS. As reviewers pointed out, key prior work is missing. In contrast to the reviewers, the AC is not persuaded by the promises made in the rebuttal. A more detailed plan for revision would have been necessary.
- The comparison to other representations for encoding occupancy is lacking. There is no comparison to e.g. simply representing an occupancy map as a multi-channel image (eventual memory constraints could be addressed through the resolution in the channel dimension). A comparison to depth maps in a ControlNet setting was made in the paper, but as the reviewers pointed out, this comparison confounds model architecture and representation. The comparison to a projection of the voxels into a single image is provided in the rebuttal, but here it unfortunately remains unclear whether the MPI encoder has been retrained and how the re-weighting strategies tuned for the MPI encoder affect the comparison.