5.0

/10

Poster4 位审稿人

最低5最高5标准差0.0

3.5

置信度

正确性2.8

贡献度2.5

表达3.0

NeurIPS 2024

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

Junsheng Zhou,Yu-Shen Liu,Zhizhong Han

OpenReview PDF

提交: 2024-05-10更新: 2024-11-06

TL;DR

Assembling diverse deep priors from large models for scene generation from single images in a zero shot manner.

摘要

关键词

Scene Reconstruction from Single ImagesZero-Shot ReconstructionDeep Prior Assembly

评审与讨论

审稿意见

评分: 5置信度: 32024-07-01

This paper introduces a framework called "Deep Prior Assembly," which combines various deep priors from large models to reconstruct scenes from single images in a zero-shot manner. By breaking down the task into multiple sub-tasks and assigning an expert large model to handle each one, the method demonstrates the ability to reconstruct diverse objects and plausible layouts in open-world scenes without additional data-driven training.

优点

The methodology effectively integrates various deep priors from different large models, enhancing the robustness of the reconstruction task and doesn’t rely on 3D or 2D data-driven training, reducing dependency on specific datasets. It introduces new optimization methods for pose, scale, and occlusion parsing, improving the collaborative effect of deep priors.

缺点

One obvious drawback of this methodology is the huge memory cost caused by assembling many pre-trained modules if we aim to apply this method on mobile edge device. Can the authors think out of a way to mitigate this problem?
The reconstruction quality of this framework relies mainly on the capacity of single-image reconstruction model, e.g. Shap-E in this paper. However, according to practical experience, Shap-E is really sensitive to the scale of input image. Is there a universal appropriate scale that fits to all the instances, as mentioned in the paper the scale is set to 6?
The methodology tries to fit the 3D proposal into the scene by aligning them with the depth estimated by Omnidata. According to the ablation, 3D matching plays the most important role compared to SD and 2D-matching in the whole framework. So how to ensure the quality of estimated depth?
I am curious about how will the methodology perform on the outdoor scenes.

问题

Please refer to the weaknesses part.

局限性

Please refer to the weaknesses part.

作者回复

2024-08-07

We deeply appreciate the reviewer veoH for the thoughtful feedback and time invested in evaluating our work. We respond to each question below.

Q1:Applying DeepPriorAssembly on mobile edge devices.

We agree that DeepPriorAssembly is now not capable of directly applying on mobile edge devices. Actually, most of the large models (e.g. ChatGPT, StableDiffusion, LLaVA) face the same challenge of inference on mobile edge devices. A more practical approach for users would be to submit the captured images to remote servers for processing, where the reconstructed 3D scenes can then be sent back to the users.

Q2:The instance scale for Shap·E.

We got the same experimental conclusion as yours regarding Shape·E's reconstructions. Shape·E is indeed sensitive to the scale of input images. Actually, we conducted ablation studies to explore the effect of instance scales on the generation quality of Shape·E, as shown in Sec.C.2 and Fig.10 in the Appendix. The visual comparison of the generations with different instance scales in Fig.10 in the Appendix show that Shap·E is quite sensitive to the scale of instances in the images, where a too small or too large scale will lead to inaccurate generations with unreliable geometries and appearances. For quantitatively evaluate the effect of instance scale, we further report the numerical results of the scene reconstruction performance under different instance scales, as shown in Table B in the rebuttal PDF. Thought both quantitative and qualitative ablation studies, we observe that the Shap·E performs the best in shape generation from instance images at a scale of 6.

For all the experiments conducted in the paper, we set the scale to 6, following the conclusions of ablation studies. We believe that the scale of 6 is a universal appropriate scale that fits to most of the instances, as demonstrated by the comprehensive quantitative and qualitative evaluations on diverse datasets in Sec.4 and Sec.B of the Appendix.

Q3:Quality of estimated depth.

We ensure the quality of estimated depth by adopting the large model Omnidata which is trained under large scale dataset for depth estimation. In practical experiments, the quality of estimated depth is quite convincing due to the powerful large model with abundant knowledge learned by large-scale training on large-scale datasets.

We also admit that the depth point cloud may suffer from inevitable occlusions of unseen areas and distortion, especially in some difficult scenes. However, we justify that DeepPriorAssembly does not require very high-quality depths since they are only used as a prior for recovering the layout of the scene containing the reconstructed high quality 3D shapes with 2D/3D matching. The estimated depths are not directly used as the scene geometries, where the high quality of reconstructed scenes is ensured by the high quality of the reconstructed 3D shapes.

Q4:Applying DeepPriorAssembly to outdoor scenes.

We further conduct experiments to evaluate DeepPriorAssembly on complex outdoor scenes and scene containing animals, as shown Fig.A in the rebuttal PDF. The first image comes from KITTI dataset, others are collected from the Internet. With the help of powerful large foundation models, DeepPriorAssembly demonstrates superior zero-shot scene reconstruction performance in these real-world outdoor scenes.

2024-08-10

Thanks for your explanation.

As this paper dedicated to assemble different priors from large foundation models, I think it's more convincing to show the robustness of the pipeline by substitute different parts with similar foundation models.
Moreover, I think it's inappropriate to claim your method's superior performance on zero-shot manner, because this capacity actually comes from the massive training data of Shap-E. Also, as we can observe from the figure A in the rebuttal PDF, the method fails to perform well on outdoor scene, which is limited by Shap-E.

Rationally, I would like to categorize this method as a straight-forward engineering effort to assemble the foundation models. Hope the authors can optimize the connection and compatibility between different priors.

评论- Responses to Reviewer veoH

2024-08-10

Dear Reviewer veoH,

Thank you for your response and the helpful comments. We respond to each of your additional questions below. Please do not hesitate to let us know if you have any additional questions.

Discussion-Q1: Ablation studies on the choice of foundation models.

We fully recognize the importance of evaluating each sub-task by substituting the foundational models with similar alternatives. Actually, we have conducted the ablation study to explore the effectiveness of our chosen solutions in each sub-task by comparing them with the alternatives. The results and analisis are presented in Sec.C.1 and Table 4 of the Appendix. Specifically, we conducted ablations to replace Shap·E with One-2-3-45, replace Open-CLIP with EVA-CLIP and replace Omnidata with MiDaS, observing performance degradation with the alternative models. We further visually compare Shap·E with One-2-3-45 for shape generation in Fig. 8, where the results demonstrate that Shap·E is a more robust solution for generating 3D models from 2D instances. These ablation studies validate the effectiveness of each choice made within our framework.

Discussion-Q2: The claim of zero-shot scene reconstruction.

Actually, the term "zero-shot" signifies that no specific data or data-driven training is required for the task of scene reconstruction from single images. Unlike previous works in this field, such as PanoRecon and Total3D, which require specific image-scene pair data, our approach does not rely on such data. The training data used in Shap·E is only capable of shape reconstruction learning, and it is impossible to train a model for single-view scene reconstruction solely by relying on the Shap·E data.

We justify that for the "zero-shot" tasks, using large-scale data from a different domain or task is not prohibited. For example, CLIP models are widely used for zero-shot image classification without requiring specific image-class pair data, though they do require massive amounts of image-text pair data for contrastive learning. Similarly, the data used in Shap·E does not undermine our claim of "zero-shot" single-view scene reconstruction.

We will provide further clarification on the tasks and claims to enhance the understanding of DeepPriorAssembly's capabilities and the data used in each foundational model integrated into our framework.

Discussion-Q3: Performance on outdoor scenes.

We would like to justify that none of the previous works can successfully reconstruct outdoor scenes from single images. All prior approaches are trained on indoor scenes and struggle to generalize to real-world outdoor images, which contain out-of-distribution objects such as trees, buildings, and animals. Leveraging powerful large foundational models, DeepPriorAssembly is the first to demonstrate the capability for zero-shot scene reconstruction in these complex real-world outdoor scenes. As illustrated in Fig.A of the rebuttal PDF, DeepPriorAssembly accurately reconstructs scene geometries even in challenging outdoor scenes. However, the texture produced by Shap·E may not be optimal for the outdoor shapes. In the future, we may consider replacing Shap·E with a shape reconstruction method that performs better on outdoor shapes to further enhance the performance of outdoor scene reconstruction.

Discussion-Q4: Optimize the connection and compatibility between different priors.

Thanks for your suggestions. In the revision, we will improve the connection and compatibility among the foundation models by conducting more ablation studies on the choice of each foundation model within our framework, supplementing the analysis presented in Sec.C.1 and Table 4.

We are deeply grateful for your invaluable feedback and the time you dedicated to evaluating our work. Your comments and expertise are sincerely appreciated. Please let us know if there is anything we can clarify further.

Best regards,

Authors

2024-08-12

Thanks for your response. I appreciate your effort on ablation studies for different foundation models.

I think the ability of "zero-shot" within your method comes from the off-the-shelf foundation models(mainly) and the divide-and-conquer thought. So this should not be a prior advantage over other data-driven methods. In my opinion, the authors should design extensive experiments to demonstrate the advantages of the way of their combination, rather than "combination" itself.

Overall, I think the technical contribution of this assembly is still limited, so I will keep my score as 5.

评论- Response to Reviewer veoH

2024-08-12

Dear Reviewer veoH,

Thank you for your response and the positive assessment. We really appreciate your expertise and all the invaluable feedback. We response to each of your additional questions below.

Discussion-Q5:Efforts on the combination way of deep priors.

We have invested extensive efforts in exploring the optimal approach to robustly integrating available deep priors for a challenging task, rather than merely combining them.

We demonstrate that the naive approach of simply combining several large models fails to solve the challenging task of zero-shot scene reconstruction. Beyond the contribution of firstly proposing to assemble large models and the task decomposition for this challenge task, we offer additional significant technical contributions on solving the critical challenge of making deep priors work together robustly. Specifically, the naive solution involves segmenting the input scene images and then generating 3D objects from the segmented instances. However, the solution fails dramatically due to: (1) the instances are often corrupted by occlusions and low-resolution, leading to failures for reconstructing complete and high-quality 3D objects, and (2) none of the existing techniques are capable of recovering the scene layout for 3D objects.

To address these challenges and robustly assemble deep priors for zero-shot 3D scene reconstruction, we propose two significant technical contributions on improving the quality of 3D instances and accurately recovering the scene layouts.

(1) To improve the robustness of the framework and overcome the challenges of occlusions and low-solution in segmented instances, we novelly introduce the StableDiffusion model to enhance and inpaint the instance images, followed by CLIP models to filter out the poor-quality samples and select the ones matching the instance most. The novel designs on task decomposition and introducing suitable deep priors are the key contributors for achieving accurate and high-quality geometries and appearances of the generated shapes.

(2) For the challenging task of recovering the layout of the scene containing the reconstructed shapes, we propose a novel approach on optimizing the location, orientation and size of instances by matching them with both 2D and 3D supervisions. The supervisions are novelly introduced from the estimated segmentation masks and the predicted depths, which are also obtained by the assembled deep priors. Moreover, a RANSAC-like solution is proposed to further improve the robustness of the pose/scale optimization. This approach is the link among deep priors and plays the key role in robustly assembling deep priors for the final target of zero-shot scene reconstruction.

Following your suggestions, we will conduct more experiments on the combination way of deep priors in our revision.

Discussion-Q6:Technical contributions.

Our technical contributions lie in the novel task decomposition, the deep priors chosen and their assembly, and the proposed RANSAC-like pose/scale optimization for recovering scene layouts. We summarize the main contributions as follows.

We propose the first framework which assembles diverse deep priors from large models together for the extremely difficult task of reconstructing scenes from single images in a zero-shot manner.
To improve the robustness of the framework and overcome the challenges in this task (e.g. occlusions and low-solution of instances), we novelly utilizes the StableDiffusion model for image enhancement and inpainting, combined with the CLIP model to filter out poor-quality samples.
We introduce a novel approach on optimizing the location, orientation and size of instances by matching them with both 2D and 3D supervisions. Moreover, a RANSAC-like solution is proposed to further improve the robustness of the pose/scale optimization. The approach is the link among deep priors and plays the key role in robustly assembling deep priors for the final target of zero-shot scene reconstruction.

Only with our designs on the task decomposition, deep prior chosen and our novel approach on RANSAC-like pose/scale optimization through both 2D and 3D matching to recover the scene layout, can the assembly of deep priors from large models effectively succeed in the extremely challenging task of zero-shot scene reconstruction.

Best regards,

Authors

审稿意见

评分: 5置信度: 42024-07-12

This work introduces a system named "Deep Prior Assembly" for zero-shot scene reconstruction from a single image. It breaks down the single-image scene reconstruction task into several steps that can be solved utilizing pretrained large models, such as SAM for segmentation, Shap-E for 3D object generation, and Omnidata for depth estimation. Additionally, an optimization-based approach is also proposed for pose estimation.

优点

1. It makes sense to break down a difficult task into simpler ones, and the outcomes appear promising.

This study puts into practice the concept of breaking down the challenging single-view reconstruction task into several simpler tasks that can be addressed using off-the-shield pretrained models.

2. Experiments and ablation studies are thorough.

Extensive quantitative and qualitative evaluations are provided in the paper, demonstrating the effectiveness of the proposed pipeline.

缺点

1. Technical contribution of this work is limited.

The work seems to be more of an engineering trial that combines pretrained models to build a reconstruction system, rather than an insightful research endeavor. The authors highlight that they are exploring "deep prior" from large pretrained models, but the priors are not well integrated and instead function independently as separate modules. Additionally, naively combining many large models could lead to high computational demand and potential error accumulation. For example, the proposed system will run a diffusion generative model several times for each object in a scene, and includes a 9.2-second optimization for pose estimation of each object.

2. Missing state-of-the-art baseline methods.

Although there are thorough evaluation results provided in the paper, all the baseline methods are proposed before 2021. Has there been any new progress in single-view scene reconstruction after 2021? For example, ScenePrior [1] published in CVPR'23 introduces a conditional autoregressive generative method for single-view reconstruction.

[1] Learning 3D Scene Priors with 2D Supervision. Nie et al. CVPR 2023.

问题

1. What time does each stage of the proposed method take? And what time do the baseline methods take?

2. What's the relation between the entire running time of reconstruction and the number of objects in the scene? And how about the baseline methods?

3. Have you considered using methods like 3D object detection for layout estimation? It could potentially offer faster and more accurate results than the proposed depth-based method.

局限性

1. A naive combination of pretrained large models might cause error accumulation.

2. The proposed method has a much longer running time compared to baseline methods.

作者回复

2024-08-07

We sincerely appreciate Reviewer DEJS for the acknowledgment of our work and constructive feedback. We respond to each question below.

Q1:Technical contribution.

We are the first to explore the cooperation among large foundation models for another extremely difficult task where none of them can accomplish alone. The key motivation of our method stems from the recent success in large foundation models, which lead a revolution in language/vision computing. The large models show brilliant capabilities and remarkable performance, but are limited in a specific task with a specific modality. Driven from this observation, we propose to explore an effective solution that leverages existing expert large models, designed and trained for specific tasks, to address the extremely challenging task of 3D scene reconstruction from single images. We aim at a zero-shot framework where no part of it necessitates extra data collection, preparation, or time-consuming data-driven training.

To this end, we propose DeepPriorAssembly, a novel framework which assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We rethink this task from a new perspective, and decompose it into a set of sub-tasks instead of seeking to a data-driven solution. We narrow down the responsibility of each deep prior on a sub-task that it is good at, and introduce novel methods related to poses, scales, and occlusion parsing to enable deep priors to work together in a robust way. We believe DeepPriorAssembly introduces a new direction for the society to flexibly exploit the potential of existing powerful large models.

Q2:Running time of DeepPriorAssembly.

We have reported the running time of each stage of DeepPriorAssembly in Sec.F and Table 5 in the Appendix. Reconstructing a scene from a single image takes 171.2 seconds in total. The inference of Grounded-SAM, Open-CLIP and Omnidata takes only about 1 second. The most time-consuming parts include the StableDiffusion, Shap·E and the RANSAC-like optimization. For a balance between the efficiency and quality, we can optionally reduce the sample number M of StableDiffuion and the iterations r of RANSAC-like solution, which can reduce the total time to less than 60 seconds. For the baseline method PanoRecon, the inference time is 30.6 seconds. We further justify that we do not require any extra time for data collection, data preparing and data-driven training. In contrast, PanoRecon requires 5 days of data-driven training, as reported in its paper, not including the time for data preparation.

Q3:Potential error accumulation.

We demonstrate that the effective integration of additional large models does not compromise the robustness of our method. In contrast, this design enhances both its robustness and accuracy. For instance, we incorporate StableDiffusion to enhance and inpaint images, resulting in an improvement in CDL1 from 0.125 to 0.110, as shown in Table 2 of the ablation study. The CLIP model is introduced to filter out poor samples, leading to more robust results, as shown in Table 3. The other proposed strategies and constraints (e.g., RANSAC-like solution, 2D/3D-Matching) are also designed to improve the robustness. Please refer to Sec.4.4 for comprehensive ablation studies on the effectiveness of integrating each module.

Q4:More comparisons with recent methods.

We have additionally compared our method with other SOTA data-driven scene reconstruction works BUOL [CVPR 2023] and Uni-3D [ICCV 2023] in Sec.D and Fig.12 of the Appendix. The results demonstrate that our method achieves better and more visual-appealing results under both 3D-Front and ScanNet datasets. Specifically, our method significantly outperforms other methods using real-world images in ScanNet.

Following your suggestions, we further compare DeepPriorAssembly with ScenePrior [CVPR 2023] under ScanNet dataset, as shown in Fig.B in the rebuttal PDF. The reconstruction results of ScenePrior is provided by its authors. As shown, DeepPriorAssembly clearly outperforms ScenePrior in terms of the quality of scene geometries. Moreover, ScenePrior can only reconstruct the geometry, whereas DeepPriorAssembly is capable of recovering high-fidelity scene appearances as well.

Q5:Relation between running time and the object numbers.

The relation is that scenes containing more objects will often lead to longer running time. However, as we analyze in Sec.F of the Appendix, the inference of Grounded-SAM, Open-CLIP, and Omnidata takes only about 1 second. The most time-consuming parts include StableDiffusion, Shape·E, and the RANSAC-like optimization. For these three components, we can process all instances of the scene in parallel (e.g., with multiple GPUs), which can significantly reduce the required running time. Specifically, by processing each instance of the scene with a separate GPU in parallel, the running time for a scene containing multiple instances may not be much longer than the time for scenes containing only one instance.

Q6:Leveraging 3D object detection for layout estimation.

The input for DeepPriorAssembly is only one single scene image, where no 3D data is available for 3D object detection. An alternative approach is to leverage the back-projected depth point clouds for 3D object detection. However, the depth point cloud is often of low quality and corrupted by occlusions of unseen areas. Therefore, it is extremely difficult for existing 3D object detection methods to accurately detect 3D objects from the corrupted depth point cloud. We also emphasize that DeepPriorAssembly does not require very high-quality depths since they are only used for recovering the layout of the scene containing the reconstructed high quality shapes. The estimated depths are not directly used as the scene geometries, where the high quality of reconstructed scenes is ensured by the high quality of the reconstructed 3D shapes.

2024-08-13

I appreciate the detailed response to the review and have also read the comments from other reviewers. While the proposed method presents impressive experimental results, it is more like an engineering effort by combining several pretrained modules to build a software pipeline, which is not quite suitable for the NeurIPS research community. Therefore, I maintain my original score of "5: Borderline accept" and would like to participate actively in the subsequent reviewer discussion period.

评论- Response to Reviewer DEJS

2024-08-13

Dear Reviewer DEJS,

Thank you for your response and the positive assessment on our rebuttal. We really appreciate your expertise and all the invaluable feedback. The technical contributions of DeepPriorAssembly lie in the the exploration of the robust and optimal approach to robustly integrating available deep priors for a challenging task, rather than simply combining them.

We show that simply combining several large models is not enough to solve the difficult task of zero-shot scene reconstruction. While our work is the first to propose assembling large models and decomposing tasks to tackle this challenge, we also make key technical contributions to ensure deep priors work together effectively and robustly.

The naive approach involves segmenting scene images and generating 3D objects from these segments, but it fails because (1) occlusions and low resolution often lead to incomplete or poor-quality 3D objects, and (2) existing techniques can't recover the 3D scene layout. To overcome these issues, we propose two main techniques:

To enhance the quality of 3D instances and handle occlusions and low resolution, we introduce the StableDiffusion model to refine and inpaint segmented images, followed by CLIP models to filter and select the best-matching instances. These innovations in task decomposition and deep prior integration are crucial for producing accurate and high-quality 3D shapes.
For recovering the scene layout, we introduce a novel method to optimize the location, orientation, and size of instances by matching them with both 2D and 3D supervision. These supervisions come from estimated segmentation masks and predicted depths, generated by the assembled deep priors. Additionally, a RANSAC-like solution further improves the robustness of pose and scale optimization. This method effectively links the deep priors and is key to achieving robust zero-shot scene reconstruction.

Thank you for your valuable feedback and the time you spent evaluating our work. We truly appreciate your insights.

Best regards,

Authors

审稿意见

评分: 5置信度: 32024-07-13

The method uses leverages multiple off-the-shelf models to parse a scene represented by a single image into 3D assets in their respective layout. Concretely they use a segmentation model to locate objects, then a diffusion model to enhance the image quality, use Shape-E to generate 3D proposals, and estimate the layout use depth estimation and point cloud matching. The method outperforms existing methods.

优点

Originality:

I think the originality is relatively low as all of the components in the system are off-the-shelf model and they are combined in a relatively straight-forward manner.

Quality:

The results look good both quantitatively and qualitatively. I am less familiar with the task this work is trying to solve, so I cannot speak to how significant the quantitative improvements are or the strength of the baselines.

Clarity:

Overall the figures illustrate the method well and the qualitative examples demonstrate the results.

Significance:

I cannot speak to the significance of the method or results as I am not that familiar with this task or the relevant literature. I think it would help to spend more time in the introduction discussing why this task is important and what are the applications. VR/AR is mentioned in passing, but more concretely anchoring the task to a specific application would help motivate the work.

缺点

The method relies on the quality of foundational models to perform well. The comparison to other baselines is not apples to apples since the other methods are not trained on the large datasets leveraged by the foundational model. This isn't necessarily a flaw in this work, since leveraging additional data could be seen as a potential advantage, but it does create an unequal comparison.

Like I mentioned in the strengths section, I think it's unclear what the motivation for the paper is. A little further discussion to anchor this work to an application would be helpful since the method novelty is not particularly high.

问题

Does this work for other types of scenes such as outdoor? Or is restricted to indoor scenes with furniture?

Suggestions:

Figure 2 has a lot of elements and is hard to parse. Maybe simplifying the figure to illustrate the main idea would help.

局限性

Yes limitations are discussed.

作者回复

2024-08-07

We sincerely appreciate the reviewer 67ke for the invaluable feedback and time invested in evaluating our work. We respond to each question below.

Q1:The applications of single-view scene reconstruction.

The task of single-view scene reconstruction greatly contributes to the domain of AIGC, AR/VR, robotics, games, 3D design, etc.

AI content generation for 3D is a hot topic recently which generates diverse 3D models from user prompts (e.g. images.). Most previous works focus on generating 3D objects, which is much easier than complex scenes. DeepPriorAssembly flexibly generates complete scenes from single images, advancing the development of this field.
Recovering scenes from single views plays a curtail role in the field of augmented/virtual reality, which allows spatial interaction between human and environments.
This task also contributes to the field of robotics. DeepPriorAssembly can recover the environment around a robot using its camera, facilitating comprehensive scene understanding for semantic recognition, collision detection, human-robot interaction, and more.
Single-view scene reconstruction can greatly improve the efficiency of game production. Given a game scene image drawn by a game illustrator, DeepPriorAssembly can directly reconstruct the 3D game scene for interaction, eliminating the need for a 3D modeler to manually recreate each 3D object from the scene image and fit them to the correct position, rotation, and scale.

We will add more discussion and analysis on the applications of DeepPriorAssembly in our revision.

Q2:The motivation and originality of DeepPriorAssembly.

The key motivation of our method stems from the recent success in large foundation models (e.g. ChatGPT, VLM, StableDiffusion, CLIP, etc.), which lead a revolution in language/vision computing. By greatly scaling up sizes of training samples and model parameters, large models show brilliant capabilities and remarkable performance. However, they are limited in a specific task with a specific modality, which limits their capability in high level perception tasks.

Driven from this observation, we propose to explore an effective and robust solution that leverages existing expert large models, designed and trained for specific tasks, to address the extremely challenging task of 3D scene reconstruction from single images. Note that most large models are public available where everyone can access these deep priors without additional efforts. We are committed to provide new insights for the society on assembling existing powerful large models at different domains and tasks for tackling another more challenging task without extra knowledge. That is, we aim at a zero-shot framework where no part of it necessitates extra data collection, preparation, or time-consuming data-driven training.

To this end, we propose DeepPriorAssembly, a novel framework which assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We rethink this task from a new perspective, and decompose it into a set of sub-tasks instead of seeking to a data-driven solution. We narrow down the responsibility of each deep prior on a sub-task that it is good at, and introduce novel methods related to poses, scales, and occlusion parsing to enable deep priors to work together in a robust way. We are the first to explore the cooperation among large foundation models for another extremely difficult task where none of them can accomplish alone. We believe DeepPriorAssembly introduces a new direction for flexibly exploiting the potential of existing powerful large models.

Q3:The evaluation fairness with other methods.

Actually, all the previous methods on single-view scene reconstruction requires additional task specific data collection and time-consuming training. In contrast, our method merely assembles existing available large models without requiring any extra knowledge. We believe that other methods require more strict data settings than DeepPriorAssembly, but are also limited to the known data distribution. For example, PanoRecon requires a large collection of image-scene pairs from 3DFront dataset and task specific training, which performs well on the test set of 3DFront but fails to generalize to out-of-distribution images in the real-world. We believe that the evaluation is fair for other methods since none of them can handle this task in our experiment conditions, i.e., without data collecting and even without data-driven training.

Q4:Applying DeepPriorAssembly to outdoor scenes.

We further conduct experiments to evaluate DeepPriorAssembly on complex outdoor scenes and scene containing animals, as shown in Fig.A in the rebuttal PDF. The first image comes from KITTI dataset, others are collected from the Internet. With the help of powerful large foundation models, DeepPriorAssembly demonstrates superior zero-shot scene reconstruction performance in these real-world outdoor scenes.

Q5:Complexity of Figure 2.

We will simplify Figure 2 to provide a clearer illustration of the main idea by moving some framework details to separate figures.

评论- Response to Rebuttal

2024-08-12

I thank the authors for their response.

After reading other reviews and other responses to reviewers I will maintain my rating of borderline, 5. I think the technical contribution is limited as foundational models are being fused together and the techniques for combining don't seem particularly general. I think the motivation for matching assets to images of scenes is rather limited in its current form. In robotics, meshes with 6 DoF are estimated from images for grasping, but this task is a bit far from those in robotics.

For the positives the paper improves over current methods, the figures and qualitative results are well done, and the writing and organization is clear. Therefore I maintain my rating of 5.

评论- Response to Reviewer 67ke

2024-08-12

Dear Reviewer 67ke,

Many thanks for the positive assessment. We really appreciate your expertise and all the invaluable feedback. Our method focuses on how to robustly join available deep priors for a challenging task. For the technical contributions, we make additional clarifications below.

We justify that the naive approach of simply introducing several large models fails to solve the challenging task of zero-shot scene reconstruction. Beyond the contribution of firstly proposing to assemble large models and the task decomposition for this challenge task, we offer additional significant technical contributions on solving the critical challenge of making deep priors work together robustly. Specifically, the naive solution involves segmenting the input scene images and then generating 3D objects from the segmented instances. However, the solution fails dramatically due to: (1) the instances are often corrupted by occlusions and low-resolution, leading to failures for reconstructing complete and high-quality 3D objects, and (2) none of the existing techniques are capable of recovering the scene layout for 3D objects.

(1) To improve the robustness of the framework and overcome the challenges of occlusions and low-solution in segmented instances, we novelly introduce the StableDiffusion model to enhance and inpaint the instance images, followed by CLIP models to filter out the bad samples and select the ones matching the instance most. The novel designs on task decomposition and introducing suitable deep priors are the key contributors for achieving accurate and high-quality geometries and appearances of the generated shapes.

Our technical contributions can be summarized as follows.

We propose the first framework which assembles diverse deep priors from large models together for the extremely difficult task of reconstructing scenes from single images in a zero-shot manner.
To improve the robustness of the framework and overcome the challenges in this task (e.g. occlusions and low-solution of instances), we novelly utilizes the StableDiffusion model for image enhancement and inpainting, combined with the CLIP model to filter out poor-quality samples.
We introduce a novel approach on optimizing the location, orientation and size of instances by matching them with both 2D and 3D supervisions. Moreover, a RANSAC-like solution is proposed to further improve the robustness of the pose/scale optimization. The approach is the link among deep priors and plays the key role in robustly assembling deep priors for the final target of zero-shot scene reconstruction.

Best regards,

Authors

审稿意见

评分: 5置信度: 42024-07-16

A multistage pipeline for single image 3D reconstruction is proposed, leveraging multiple off-the-shelf models. To begin, SAM is used to segment and decompose the input image. Stable diffusion is then leveraged to complete instance segments with potentially missing information, and failures of this process are filtered out by CLIP. Finally, Shap-E is applied to generate 3D models, which are then registered to the image for a final 3D reconstruction.

优点

The proposed technique achieves strong zero-shot performance relative to baseline methods despite not training on similar data.

缺点

The method seems overly complicated and unlikely to be robust.

The argument about heuristic selection of depth shift seems unconvincing; in practice, the correct depth shift from a scale and shift invariant monodepth estimator can vary widely between multiple images even in the same scene. Why not use multiple pairs of images to compute the appropriate depth shift, or metrically ground the depth shift as in RealmDreamer?
In order to go from a SAM instance to mask, a very complex pipeline is proposed. Essentially, it consists of inpainting with SD, followed by filtering with CLIP, followed by shape estimation with Shap-E, followed by a likely unstable alignment procedure. This seems error-prone. Why not just train an LRM-like model that accepts potentially off-center and partially occluded instance images and produces 3D objects aligned to the input camera’s location? This seems not too difficult to train, there are 10M+ object datasets publicly available to achieve it, and it would certainly be more robust than the proposed pipeline.

问题

Since the method consists of chaining several foundation models together, it should be possible to show "zero-shot" performance on scenes not limited to the simple synthetic or indoor room scenes shown here. Can the proposed method work on more complex scenes, such as real images, outdoor scenes or scenes containing animals or people?

局限性

Yes, there is a thorough limitations discussion.

作者回复

2024-08-07

We deeply appreciate the reviewer osDy for the thoughtful feedback and time invested in evaluating our work. We respond to each question below.

Q1:The robustness of DeepPriorAssembly.

We demonstrate that the effective integration of additional large models does not compromise the robustness of our method. In contrast, this design enhances both its robustness and accuracy. For instance, we incorporate StableDiffusion to enhance and inpaint images, resulting in an improvement in CDL1 from 0.125 to 0.110, as shown in Table 2 of the ablation study. Additionally, we introduce the CLIP model to filter out poor samples, leading to more robust scene reconstruction results (0.118 to 0.110), as demonstrated in Table 3 of the ablation study. The other proposed strategies and constraints (e.g., RANSAC-like solution, 2D-Matching, 3D-Matching) are also designed to improve the robustness of DeepPriorAssembly. Please refer to Table 2 and Table 3 for comprehensive ablation studies on the effectiveness of each module in improving the robustness of our method.

Q2:The heuristic selection of depth shift and RealmDreamer solution.

We greatly appreciate the insightful advice from reviewer osDy on depth shift estimation. We fully agree that using multiple image-depth pairs to compute the appropriate depth shift will lead to more accurate and robust depth scale and shift. We used only one image-depth pair to reduce the requirement for ground truth depths, and found the results to be convincing. Additionally, we provide ablation studies on the number of image-depth pairs in Table A in the rebuttal PDF.

RealmDreamer metrically grounds the depth scale and shift by aligning relative depths with the metric depths predicted by DepthAnything. We agree that this solution could potentially be a better approach for directly solving depth shift. However, RealmDreamer was released on arXiv in April 2024, only one month before the NeurIPS deadline, by which time we had already completed the core development of DeepPriorAssembly. To evaluate the effectiveness of a RealmDreamer-like solution in our framework, we replace our depth solution with it and conduct experiments in complex outdoor scenes, as shown in Fig.A in the rebuttal PDF. The results indicate that the RealmDreamer-like solution integrates well into our pipeline. Nevertheless, we justify that our depth solution also performs well across different datasets, as demonstrated by the comprehensive experiments in Sec.4.

Q3:The effectiveness of proposed pipeline for shape reconstruction from SAM instances.

As far as we know, none of the previous approaches can reconstruct complete and high-fidelity shapes from occluded 2D instances. We propose the first solution for this difficult task from a new perspective by assembling deep priors from large models, without requiring task-specific data preparing, model designing or training. Comprehensive evaluations and ablation studies demonstrate the effectiveness of the proposed pipeline for recovering complete and high-fidelity shape reconstructions from occluded and low-resolution SAM instances.

Q4:Why not train a LRM-like for shape reconstruction and pose estimation.

The key motivation of our method is to explore an effective and robust solution that leverages existing expert large models, designed and trained for specific tasks, to address the extremely challenging task of 3D scene reconstruction from single images. We are committed to provide new insights for the society on assembling existing powerful large models at different domains and tasks for tackling another more challenging task without extra knowledge. That is, we aim at a zero-shot framework where no part of it necessitates extra data collection, preparation, or time-consuming data-driven training. Training a LRM-like model for solving the part of shape reconstruction from SAM instances is not our way, as it requires extensive task-specific efforts in model design, data preparation, and training.

Moreover, we justify that none of the previous approaches can reconstruct complete and high-fidelity shapes from occluded 2D instances produced by SAM. The difficulties that prevent LRM-like techniques from successfully solving this task lie in:

The manner, location, and ratio of occlusions in SAM instances are unpredictable and vary significantly from scene to scene and instance to instance, making it extremely difficult to collect corrupted instance-complete shape pairs and to stably train an LRM-like model.
The 2D instances often suffer from low resolution. For small instances or instances far from the camera, the resolution can be low, leading to difficulties for a LRM-like model to accurately capture the semantics in the instance image and reconstruct shapes with details.
Real-world images contain diverse categories of instances, requiring an extremely large annotated dataset for this task. The scale of available 3D data (10M) is still much smaller than publicly available 2D/language data.

Our proposed pipeline effectively utilizes large models pretrained on billions of 2D/language data, demonstrating superior performance in reconstructing high-fidelity shapes given only corrupted SAM instances containing occlusions and low resolution as inputs.

Q5:Applying DeepPriorAssembly to real images, outdoor scenes or scenes containing animals or people.

We justify that the ScanNet dataset is a real-captured dataset. We have shown the scene reconstruction results under real images of ScanNet dataset in Fig.14 of the Appendix. We additionally conduct experiments to evaluate DeepPriorAssembly on complex outdoor scenes and scene containing animals, as shown in Fig.A in the rebuttal PDF. The first image comes from KITTI dataset, others are collected from the Internet. With the help of powerful large foundation models, DeepPriorAssembly demonstrates superior zero-shot scene reconstruction performance in these real-world outdoor scenes.

评论- Rebuttal

2024-08-12

Dear reviewer,

Please read the author rebuttal and the other reviews and post a comment as to how your opinion has or has not changed and why.

2024-08-13

Thanks for the thoughtful rebuttal. I'll change my score to borderline accept.

I'm still unconvinced about the pipeline setup. You might consider https://zixuanh.com/projects/zeroshape.html (in particular how they handle their training data) as an alternative and competing approach which would be likely to have strong performance if trained in the specified way.

The results on in the wild images are impressive! I would definitely suggest featuring these (and more + other examples) more prominently in a revised version of the manuscript, since the existing datasets are rather monotonous and supervised methods would probably perform well, so the "zero-shot" aspect is not really emphasized.

评论- Response to Reviewer osDy

2024-08-13

Dear Reviewer osDy,

Many thanks for all the helpful comments and positive assessment. We will include the new experiment results and more reconstruction sampls in the revised version of the manuscript. Following your suggestions, we consider training a reconstruction model similar to ZeroShape for predicting complete shapes as a future work of DeepPriorAssembly. We will include the insights to Sec.5 of the revised paper and will conduct experiments to explore the effectiveness of this alternative.

We really appreciate you for upgrading the score.

Best regards,

Authors

作者回复

2024-08-07

We are grateful to the reviewers for their invaluable feedback and the time they dedicated to evaluating our work. We are delighted that reviewers appreciated the representation and the significance of the paper. We respond to each reviewer separately with detailed analysis, visualizations and ablation studies to solve all the raised questions. We upload a rebuttal PDF with some experimental results and visualizations. For the following rebuttals, we use “rebuttal PDF” to point to the provided PDF like “in Table A of the rebuttal PDF”.

Thank you again for your insightful feedback and we are looking forward to continuing the discussion.

评论- We will be happy to take any questions

2024-08-08

Dear reviewers,

We appreciate your comments and expertise. Please let us know if there is anything we can clarify further. We would be happy to take this opportunity to discuss with you.

Thanks,

The authors

2024-08-12

Dear reviewers,

As the reviewer-author discussion period is about to end, we are looking forward to your feedback on our rebuttal. Please let us know if our responses address your concerns. We would be glad to make any further explanation and clarification.

Thanks,

The authors

最终决定Accept (poster)

2024-09-25

All reviewers generally felt that reasons for acceptance outweigh those for rejection, although they agree that this feels like an engineering effort. Nevertheless, the AC recommends acceptance.