Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors
Image-to-3D using case-aware diffusion priors.
摘要
评审与讨论
The manuscript proposed a pipeline for generating 3D content from a single image. Consistent123 builds upon the 3D prior loss proposed in Zero-1-to-3 [Liu et al, 2023], 2D prior loss proposed in DreamFusion [Poole et al, 2023], and reference image reconstruction loss setup in Magic123 [Qian et al, 2023]. The key insight of the approach is that 3D prior is helpful for generating consistent structures while 2D prior can improve details. Thus, an annealing strategy and CLIP-based switching mechanism are designed to blend the two phases together. Consistent123 is validated on the public available RealFusion15 and a customized C10 dataset.
优点
- Consistent123 is benchmarked against many recent SOTA baselines for single-image 3D shape generation.
- The methodology is logically sound based on empirical observations.
缺点
-
Incomplete ablation experiments
-
The effectiveness of CLIP-guided termination is never assessed. A simple baseline would be a fixed training iteration set for 3D phase and then switching to the dynamic phase. Moreover, monitoring the change of CLIP-rate can be stuck at local minima and may exhibit large variations across object categories. The heuristics on the threshold and moveing average length are also not assessed.
-
An annealing strategy is used to gradually enable 2D prior, and various experiments are conducted to assess the decaying schedule. However, it is unclear if activating 2D prior in a binary way and prolonging the 3D phase will be more effective.
-
Experiment setup
-
While the paper mentions the collection of additional datasets, there is limited information provided about the customized C10 dataset. It's essential to know how many objects are in this dataset and how the 100 views are distributed to assess its representativeness.
-
The significance of the RealFusion15 results is questioned because only 15 objects were evaluated. The paper should consider conducting experiments on a larger scale, similar to RealFusion on CO3D dataset [Reizenstein et al, 2021], or Zero-1-to-3 on Google Scanned Objects [Downs et al., 2022] and RTMV [Tremblay et al., 2022], where evaluations involve more than 1000 objects.
-
Unclear descriptions of methodology:
-
Reference view reconstruction on the input image is described but it is unclear at which stage it is applied.
-
The normal maps are shown in Figure 3 yet the text states Consistent123 uses masks.
-
Unclear descriptions of the experiment
-
It is unclear if the results are evaluated on the reference image (input) or novel views.
-
It is unclear which dataset the ablation of Table 2 is performed on.
References:
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common Objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. In Proc. CVPR, 2021. 7
Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3D scanned household items. In ICRA, 2022. 4, 7
Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Charles Loop, Nathan Morrical, Koki Nagano, Towaki Takikawa, and Stan Birchfield. RTMV: A ray-traced multi-view synthetic dataset for novel view synthesis. ECCVW, 2022. 5, 7
问题
- Can authors provide additional insights onto the effectiveness of CLIP-based termination and its potential drawbacks? Can authors provide additional insights onto the importance of annealing strategy?
- Can authors provide additional details on the C10 dataset and justify the significance of evaluation results?
- Can authors clarify the confusing parts of the method as well as experiments?
- Misc:
- Page 2, it claims that “with 3D structure priors, … (prior works) avoid multi-face issues, but struggle to obtain consistent reconstruction”. In my opinion, these are the same thing, which can be broadly categorized as “content drifting”.
Thank you for your time and helpful feedback. We respond below to your questions:
Q: Can authors provide additional insights onto the effectiveness of CLIP-based termination and its potential drawbacks?
Thank you for the suggestion to conduct more experiment to fully assess the effectiveness of CLIP-guided termination. We follow the setting of your mentioned simple baseline, and the fixed training iteration for 3D phase is set to 3000. The results on RealFusion15 and C10 are shown in the following table.
| RealFusion15 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.844 | 25.682 | 0.056 |
| fix3D-3000 | 0.813 | 25.604 | 0.068 |
| C10 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.770 | 25.327 | 0.054 |
| fix3D-3000 | 0.746 | 25.066 | 0.063 |
We agree that monitoring the change of CLIP-rate can be stuck at local minima, and this is the main drawback of CLIP-based termination. To mitigate the problem of local minima, we introduce a sliding window that takes into account the CLIP-rate of change over a certain time interval. As for concern about "the change of CLIP-rate may exhibit large variations across object categories", that's why we propose the case-aware detection mechanism. We conduct additional experiments on the threshold and sliding window size . The results:
| RealFusion15 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.844 | 25.682 | 0.056 |
| 0.812 | 25.464 | 0.068 | |
| 0.814 | 25.272 | 0.070 | |
| 0.817 | 25.528 | 0.069 | |
| 0.815 | 25.607 | 0.068 |
| C10 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.770 | 25.327 | 0.054 |
| 0.748 | 25.651 | 0.065 | |
| 0.753 | 25.151 | 0.065 | |
| 0.747 | 25.506 | 0.060 | |
| 0.753 | 25.398 | 0.066 |
Q: Can authors provide additional insights onto the importance of annealing strategy?
We conduct the experiment in which we activate 2D prior in a binary way and the prolong the 3D phase. The results are shown in teh table below.
| RealFusion15 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.844 | 25.682 | 0.056 |
| binary2D | 0.813 | 24.836 | 0.070 |
| C10 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.770 | 25.327 | 0.054 |
| binary2D | 0.730 | 25.562 | 0.067 |
Q: Can authors provide additional details on the C10 dataset and justify the significance of evaluation results?
C10 consists of 100 Internet images collected from 10 categories. C10 images were not obtained by rendering 3D objects. The main purpose of our evaluation on C10 is to evaluate the ability of the method to generalize over different categories of objects.
Q: Can authors clarify the confusing parts of the method as well as experiments?
the input image is described but it is unclear at which stage it is applied.
Specifically, in the code implementation, in one of the four iterations (either Stage 1 or Stage 2), a reconstruction of the reference viewpoint is performed, by reducing the gap between the rendering of the input image viewpoint and the input image.
The normal maps are shown in Figure 3
We sincerely apologize for our mistake. Indeed, we do not utilize the normal map, and in the revision, we have replaced the normal map with the mask image.
if the results are evaluated on the reference image (input) or novel views.
For quantitative evaluation, we adopt three metrics, namely CLIP-similarity, PSNR and LPIPS. Due to the absence of GT 3D model, the PSNR and LPIPS results are only evaluated on the reference image (input). We got CLIP-similarity by calculating the average CLIP distance between the rendered images of novel views and the reference view.
dataset the ablation of Table 2 is performed on.
We randomly selected 5 objects for each category in the RealFusion15 and C10 image set as our data for the ablation experiment of Table 2. We have made it clear in the revision.
Misc: Page 2, it claims that “with 3D structure priors, … (prior works) avoid multi-face issues, but struggle to obtain consistent reconstruction”. In my opinion, these are the same thing, which can be broadly categorized as “content drifting”.
Thanks for the correction. The sentence you mentioned is indeed poorly expressed. We have modified it in realvision to read: "With 3D structure prior, Liu et al. (2023) and Qian et al. (2023) can stably recover the 3D structure of an object, but struggle to obtain highly consistent reconstruction "
I really appreciate the feedback from the authors. Some of the experiments, particularly about the CLIP guidance design choices indeed clear some of the confusion. In future work, a more principled way of obtaining the threshold values and moving average window length may be needed.
However, my main concern is still about the evaluation setting and whether Consistent123 indeed surpasses the previous methods and brings additional insights to the community. I believe Reviewer qhyn also agrees with me on this:
- Based on the authors' comments, the CLIP-similarity between the reference view and the rendered view is the only proxy if the 3D content is faithfully recovered. Such a setting inherently is not meaningful for more oblique viewpoints deviating from the input viewpoint, which is important for 3D content quality evaluation.
- Given the limited variability and the small number of objects used for evaluation, the authors collected a dataset C10 containing 100 objects from the internet for evaluation. However, these objects were not checked against Objaverse nor contain multi-views of the same object content. Thus, the C10 doesn't really mitigate the concerns of proper evaluation.
As I suggested in my original feedback, evaluating the approach on data with multi-views of the same object and reporting the PSNR/LPIPS for the held-out views is a more meaningful way to report the results. I am not quite convinced that the current results are reliable.
We concur with your perspective that solely evaluating the differences between predicted values and ground truth in reference views for image-to-3D tasks does not adequately measure the quality of reconstruction. Consequently, following the approach of Zero-1-to-3, we have chosen Chamfer Distance and volumetric IoU as metrics which directly response to the gap between two meshes. We calculate these metrics on the Google Scanned Object (GSO) 3D dataset to compare the differences between the generated meshes and the ground truth meshes in the dataset. This comparison reflects the overall reconstruction quality of our method relative to previous image-to-3D approaches.
It's noteworthy that in the cases randomly selected from the GSO dataset, our method outperforms both Zero-1-to-3 and Magic123 in terms of Chamfer Distance and volumetric IoU. The detailed experimental results are as follows:
| Magic123 | Zero123 | Consistent123 | |
|---|---|---|---|
| Chamfer Distance ↓ | 0.109916 | 0.109347 | 0.037091 |
| volumetric IoU ↑ | 0.351257 | 0.352527 | 0.565129 |
Instead of calculating the gap in image quality evaluation metrics between multi-view images, we directly choose to measure the gap between the generated values and the true values at the mesh level, which allows for a more intuitive and comprehensive evaluation of the results of our approach against those of other image-to-3D work. We hope this result helps you to understand the result quality of our method.
Thank you for the response. Having the chamfer distance and volumetric IoU indeed makes the evaluation stronger. For GSO dataset, which objects are selected for evaluation? Is there a reason not all objects are used?
Other than the 3D metric evaluation, I still think it is important to evaluate PSNR on held-out views if possible. 3D content creation after all is not only about the geometry but also the textual prediction.
The paper introduces a two-stage framework to achieve the highly detailed and consistent 3D reconstruction from a single image, called Consistent123. The proposed methods utilize the generation prior of 2D and 3D, ie. the SDS loss of stable diffusion and Zero123, which can be able to realize the consistent 3D reconstruction from a single image. The qualitative and quantitative experiments demonstrate the superiority of the proposed methods over the other 3D generation methods
优点
- The paper provides detailed information and evaluations on Consistent123, allowing readers to gain a deeper understanding of the framework and its implementation.
- The organization is very well-structured and includes more details on the network architecture, and algorithm. It is very easy to follow. The authors provide clear and detailed explanations of the concepts, technical points, and evaluations, ensuring that readers can follow and reproduce the framework effectively.
缺点
- For the proposed methods, how to avoid the ‘multi-head’ artifacts since the major results do not show the back face of the object, such as in Figure 5? If the methods can handle the cases, could you give some analysis or other evaluation to support them? Otherwise, please make the limitation more clear and show some failure cases, which are more important for the reader and the future direction of the field.
- For equation 3, how to determine the weight of the different losses. The L_depth is not defined if the proposed methods use the depth as the supervision, please explain how to obtain the accurate depth.
- For section 3.3, the name ‘dynamic prior’ is very confusing, I suggest changing it to the ‘adaptive’ prior since equation 8 passes the ‘adaptive’ meaning. For the novel point, I am very curious about the weight changes during the optimization, if there is a visualization, it will be better to understand section 3.3. BTW, what about the difference from the Magic123 for this point? It is very important if the rebuttal provides the explanations.
- What about the running times for generation one 3D shape for stage 1 and stage 2?
- For the ablation studies, depth supervision is also needed to evaluate, as well as the different weight combinations.
- In the experiments, more datasets should be also evaluated, such as NeRF4(proposed in Magic123) and Objaverse(https://objaverse.allenai.org/). Just evaluating on two datasets is weak for the strong 3D generation methods.
问题
see weakness
Overall, the paper proposed a novel framework to achieve consistent 3D shape generation with a two-stage optimization strategy. And the evaluation and validation are given more evidence to support the superiority of the proposed methods. However, there are still some unclear issues, weak novelty, and insufficient evaluations (list them as weaknesses). According to these issues, the major concern is the weak novelty (the dynamic prior is very similar to the Magic123), so I lean toward a borderline score for the submission and am looking forward to the response to the above question.
伦理问题详情
N/A
Q1: Regarding the display of results for the multi-head issue you mentioned, please refer to Section A.1 in the Appendix. There, we have detailed the multi-view sampling results of the Realfusion15 and C10 dataset tests, demonstrating how our algorithm effectively avoids the multi-head problem by integrating 3D and 2D priors.
As mentioned in our paper's introduction, we observed the effects and final outcomes of the optimization process using pure 3D and 2D priors. We believe that the need for 3D and 2D priors varies at different timesteps during representation optimization. Hence, in the early optimization stages, where structural guidance is more crucial, we use pure 3D priors to ensure the 3D assets acquire correct geometric information (effectively resolving the multi-head issue). Following this, we employ a pretrained image-text model to assess if the structure is sufficiently optimized. Once stability in the assessment indicators is achieved, we gradually increase the 2D priors that offer high-resolution detail textures, thereby providing the most needed information at each optimization step.
Q2: In Equation 3, we used weight proportions for different components referencing the default ratios in the stable-dreamfusion framework (λ_rgb=1000, λ_mask=500, λ_depth=10). We also supplemented our paper with experimental results showing different orders of magnitude in parameter swapping and scenarios without using depth priors for reconstruction (λ_rgb=1000, λ_mask=500, λ_depth=0; λ_rgb=10, λ_mask=1000, λ_depth=500; λ_rgb=500, λ_mask=10, λ_depth=1000).
| Configuration | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| Default: 1000, 500, 10 | 0.844 | 25.682 | 0.056 |
| No Depth: 1000, 500, 0 | 0.812 | 25.097 | 0.069 |
| 10, 1000, 500 | 0.763 | 18.597 | 0.174 |
| 500, 10, 1000 | 0.701 | 16.961 | 0.206 |
| 800, 600, 400 | 0.795 | 23.288 | 0.085 |
We used the same depth estimator as the stable-dreamfusion framework. EPFL-VILAB Omnidata GitHub Repository This tool is also integrated into stable_dreamfusion/preprocess_image.py.
Q3: We appreciate your attention to the "dynamic prior" used in our second stage. The "adaptive" process you mentioned is not reflected in Equation 8. The design around Equation 8 primarily aims to gradually increase the weight of 2D priors, which provide rich texture information, and decrease the weight of 3D priors, which offer structural information, once the 3D assets have acquired geometric data. The adaptive process occurs during the first stage of optimization using pure 3D priors, where the CLIP model assesses the optimization boundaries through images sampled from 8 different perspectives. Due to varying degrees of geometric and texture complexity in different cases, the first stage, which primarily optimizes geometry, requires different timesteps for different cases to acquire sufficient structural guidance. Thus, the Optimization Boundary Judgement (OBJ) is an adaptive process to detect whether 3D assets have gained adequate geometric information. The visualization of the weight changes during the optimization is presented in Fig 11 in the Appendix. Our work, similar to Magic123, optimizes representations by merging 2D and 3D priors. However, a core difference from Magic123 is that we do not use a fixed ratio of 2D and 3D throughout (). We divide the optimization into two stages: (1) In the first stage, we use pure 3D priors (see Equation 4) to quickly enable NeRF to acquire high-quality geometric information. (2) We use optimization boundary judgement to assess if the assets have received sufficient structural information from the 3D prior. (3) When the OBJ determines that the asset is sufficiently structured, we follow Equation 8 to gradually reduce the weight of the 3D prior in the optimization target, while increasing the weight of the 2D prior. This ensures that the 2D priors, which provide richer texture information, gradually dominate the optimization targets while maintaining geometric accuracy.
Q4: For most cases, each of the two stages takes about 15 minutes, totaling approximately half an hour.
Q5: The experimental results for the depth comparison can be found in Q2.
Q6: We greatly appreciate your valuable suggestions regarding the experiments. Regarding the Objaverse dataset you mentioned, it is used for training the Zero123 model (the 3D prior we use), making it unsuitable for testing purposes. Therefore, we have supplemented our appendix with additional experimental results on the NeRF4 datasets (see Figure 10 in the Appendix). We could not obtain reasonable reconstruction on NeRF4, but these are also the failure cases for all the methods compared.
Dear authors,
Thanks for your detailed response and efforts in correcting the paper. Some of my concerns are addressed. These revisions are encouraged to add to the paper for the clear description. Furthermore, the issue on the dynamic priors, I still think it is similar to Magic123 and has limited novelty. Are there have any difference if I don't see?
Best,
Thank you for your feedback. We will incorporate the experimental details and supplementary explanations into our paper. Addressing your concern about the similarity between our dynamic prior approach and Magic123, the key difference lies in how the priors are utilized.
Magic123 consistently uses a fixed ratio of 2D and 3D priors for all cases. We believe this approach does not adequately cover all scenarios, as some cases require more structural information due to their complex nature, while others demand higher quality texture guidance for their intricate textures.
In contrast, our method differs significantly from Magic123. The ratio of 3D to 2D priors in our approach dynamically changes, tailored to the needs identified in our experiments. We observed that structural information is more crucial in the early stages of representation optimization, while high-quality texture guidance becomes more vital in the later stages.
Consequently, we are confident that our method of progressively adapting the prior ratios can optimize 3D assets with more accurate structures and consistent textures, compared to the fixed approach of blending 3D and 2D priors.
The paper proposes Consistent123, a method for single image 3D reconstruction with text prompts. It is a follow-up work of Magic123 (Qian et al., 2023). To optimize novel views, it uses a two-stage case-aware optimization process. Stage 1 optimizes NeRF with only a 3D prior, i.e., the SDS loss of Zero 1-to-3 (Liu et al., 2023). An adaptive detection mechanism determines when to transition to Stage 2. Stage 2, like Magic123, brings in a 2D prior (the SDS of Stable Diffusion) for texture details, but with scheduling of the 3D-2D ratio over time. Evaluated on RealFusion15 (15 images) and a self-collected C10 dataset (100 images from 10 categories), Consistent123 seems to produce better results than Magic123.
优点
It introduces three major engineering tricks for the optimization: two-stage, CLIP-based stage transition, and diffusion prior ratio scheduling.
- Two-stage optimization helps mitigate the multi-face/Janus issue while maintaining good texture.
- CLIP-based boundary judgment makes the stage transition automatic and case-aware.
- Diffusion prior ratio scheduling avoids the manual ratio trade-off in Magic123.
缺点
- Trivial novelty in the method. Magic123 proposes to trade off the weights of two SDS losses. This work basically finetunes the trader-off process with stage split and scheduling. It is not technically novel.
- Insufficient quantitative evaluation. No 3D evaluation is performed to show the overall structural quality. The paper could have rendered some synthetic and scanned 3D meshes for evaluation. PSNR and LPIPS only reflect how much the input/reference view is overfitted in the NeRF. For example, RealFusion cannot generate reasonable geometry but beats Magic123 by a large margin in these two metrics according to Table 1.
- Lack of comparisons to some related work. Shap-E (Jun et al., 2023) and One-2-3-45 (Liu et al., 2023) are two SOTA papers in image-to-3D, released earlier than Magic123. However, they are not cited and compared in this paper.
- Additional prior, unfair comparison: It also needs text prompts as inputs. The paper does not specify how text prompts are prepared for the experiments. But, according to Figure 3., “two donuts” as the text prompt is pretty specific and introduces important additional prior for the optimization. It leads to unfair comparisons with other methods.
- Potentially increased time cost. The paper does not report the expected time cost for the optimization.
- The pipeline is engineered to favor 3D prior at the beginning. It may bring some improvements on cherry-picked examples but might deteriorate in other cases.
问题
-
The authors fail to acknowledge other previous work in the Sec 3. Methodology. This is not proper. It hinders the readers to know how ideas get inherited.
a. Sec. 3.1 follows Make-it-3D and Magic123 to add the normalized negative Pearson correlation depth loss in addition to the common color and mask losses. But the off-the-shelf depth estimator seems changed.
b. Sec. 3.2 uses the loss of 3D prior (Eq. 4) proposed in Stable DreamFusion’s implementation of Zero 1-to-3. Later, it is adopted in Magic123.
c. Sec. 3.3 uses the loss of 2D prior (Eq. 7) from DreamFusion. The combined loss (Eq. 8) is following Magic123 but change the coefficient to a timestep-based one.
-
In the explanation of 3D prior (Eq. 4), the authors wrote that “R and T mean the positional coordinate parameters of the camera.” This is an unprofessional mistake. R determines the orientation of the camera, but not the location.
-
Sec 1. Introduction: In fact, Magic123 cannot “avoid the multi-face issues.” There are even some multi-face shapes on its webpage, such as the multi-head dragon and the multi-beak bird.
-
Objaverse (Deitke et al., 2022) is used to train the important 3D prior (Zero 1-to-3) in this paper. But it is not cited.
-
There is a limitation section but without qualitative examples of failure cases. Also, those results that are not preferred to other methods’ (e.g., Magic123’s) should be also shown.
-
All reference images seem to assume zero elevation. It is pretty common for optimization methods to fail on non-zero-elevation reference images even given the elevation.
The paper was written with many grammatical mistakes, to list a few:
- Abstract: can exhibits → can exhibit
- Page 2 bottom: dataste → dataset.
- Either “on Realfusion” or “on the Realfusion dataset,” but not “on Realfusion dataset.”
- Page 3 top: an case → a case
- Sec. 3.1: The design employ → The design employs
Thank you for your detailed and constructive feedback. We have carefully considered each of your comments and have made efforts to improve our paper accordingly.
Q1: In response to your concern about not introducing previous work in Sec 3. Methodology to demonstrate the development of our ideas, we believe that a thorough review of the field and the positioning of our problem to be solved have already been addressed in the introduction and related work sections. a. Regarding the use of depth prior, our algorithm's implementation is indeed based on both the Magic123 and Stable-dreamfusion frameworks. The depth estimator we used for our experiments is primarily based on the one integrated in Stable-dreamfusion, available at [GitHub Link]. We will make modifications to Sec 3.1 of the original text to reflect this. b. Our algorithm, like Magic123, is implemented based on the Stable-Dreamfusion framework. As you mentioned, the 3D prior loss proposed by Zero123 and the 2D prior loss proposed by Stable Diffusion are both integrated into this framework. c. Considering your concerns about similarities with Magic123, we would like to re-emphasize our differences: (1) In the first stage, we use pure 3D priors (see Equation 4) to optimize NeRF for rapid acquisition of high-quality geometric information. (2) We employ optimization boundary judgement to assess whether the optimized assets have received sufficient structural information from the 3D prior. (3) When OBJ determines that the asset is adequately structured, we gradually reduce the weight of the 3D prior in the optimization target and increase the weight of the 2D prior according to Equation 8. Besides both using 3D and 2D priors, our method of applying these priors throughout the optimization process distinctly differs from Magic123.
Q2: You are correct that our initial explanation of the R and T parameters could lead to ambiguity. The R parameter represents the camera's orientation in the form of a rotation matrix, while the T parameter represents the camera's translational movement in the coordinate system. Therefore, we have revised our original statement from “R and T mean the positional coordinate parameters of the camera” to “R and T represent the rotation and translation parameters of the camera.”
Q3: Magic123's use of 3D priors does help mitigate the Janus problem associated with purely 2D-based methods to some extent, but it still fails to adequately cover all cases, particularly those with complex geometric information. This phenomenon, to a certain degree, motivates our paper. One reason why dependence on 3D priors still results in the Janus problem is the lack of sufficient 3D prior guidance during the geometric initialization phase. Our method ensures sufficient geometric information is acquired through Optimization Boundary Judgement before gradually integrating 2D prior information.
Q4: We have decided to adopt your suggestion regarding additional citations and comparative experiments. As shown in Fig. 10, we could not obtain reasonable reconstruction results for the images in NeRF4, but this is also a failure case for all the methods compared.
Q5: As shown in Figure 10 in the Appendix, we could not obtain reasonable reconstruction results for the images on NeRF4, but these are also the failure cases for all the methods compared.
Q6: We agree that the elevation of the input image will affect the reconstruction quality. Like DreamGaussian, we will adopt the elevation estimation module to get the elevation information, which is later added into the textual description.
We have also addressed the grammatical errors you pointed out in the revision.
Currently, there are no responses to the weaknesses mentioned in the review.
Weakness 1
Looking back at previous work on 3D reconstruction, some are based purely on 2D priors as guidance, while others use a 3D diffusion model to generate new perspectives. Even though Magic123 proposed to use both 2D and 3D prior, it is challenging to effectively combine the two. The core of our work is more rational optimization based on the need for guidance during the representation training process. We believe our method of segmented, OBJ(Optimization Boundary Judgement)-based optimization with gradual injection of 2D priors yields more consistent 3D assets.
Weakness 2
Considering your mention of the overall quality assessment of 3D meshes, we supplemented our paper with reconstruction metrics based on samples from the Google Scanned Object 3D dataset. We selected Chamfer Distance and volumetric IoU (the same as Zero-1-to-3) to measure the distance between the generated mesh and ground truth mesh, and compared our method with Magic123 and Zero123. The reconstruction quantitative evaluation results are as follows:
| Magic123 | Zero123 | Consistent123 | |
|---|---|---|---|
| Chamfer Distance ↓ | 0.109916 | 0.109347 | 0.037091 |
| volumetric IoU ↑ | 0.351257 | 0.352527 | 0.565129 |
These evaluations on mesh further validate the superiority of our method.
Weakness 3
We conducted comparative experiments with Shap-E (Jun et al., 2023) and One-2-3-45 (Liu et al., 2023), testing separately on the R15 and C10 datasets. The results are shown in the following tables:
One-2-3-45/C10+R15 Dataset
| One-2-3-45 | Clip ↑ | PSNR ↑ | LPIPS ↓ |
|---|---|---|---|
| C10 | 0.673 | 14.081 | 0.277 |
| R15 | 0.679 | 13.754 | 0.329 |
Shap-E /C10+R15 Dataset
| Shape-E | Clip ↑ | PSNR ↑ | LPIPS ↓ |
|---|---|---|---|
| C10 | 0.508 | 6.239 | 0.639 |
| R15 | 0.544 | 6.749 | 0.598 |
Consistent123/C10+R15 Dataset
| Consistent123 | Clip ↑ | PSNR ↑ | LPIPS ↓ |
|---|---|---|---|
| C10 | 0.770 | 25.327 | 0.054 |
| R15 | 0.844 | 25.682 | 0.056 |
Weakness 4
We apologize for not clearly stating in the paper's pipeline that Consistent123 uses the same method as Make-it-3D (Tang et al., 2023) to obtain text, by inputting images into the caption model BLIP-2 (Li et al., 2023) for descriptions used in the second stage with 2D priors. Additionally, apart from the caption model, we can use textual inversion (Gal et al., 2022) based on image input to obtain specific tokens for the image. Comparative experiments between caption model and textual inversion can be seen in Appendix Section A.2.
Weakness 5
We are glad you highlighted the issue of optimization time. Through thorough experimentation, we verified that optimizing the first stage with OBJ and the second stage with both priors does not significantly affect the optimization duration. In most cases, each stage is about 15 minutes long.
Weakness 6
Addressing your concern about our method's initial bias towards 3D priors possibly leading to quality improvement in some cases but poor generalization overall, we explain from two aspects. Firstly, for cases where optimization with only 2D priors would lead to the Janus problem, Consistent123 uses 3D priors with correct geometric structure for sufficient structural initialization. Secondly, for assets generated using only 3D priors with lower texture quality, our method can provide higher resolution 2D priors later to compensate for the loss of texture details caused by using only 3D priors.
As mentioned in Section 3.3, we believe both priors have complementary characteristics, and we progressively change them based on the representation's need during the optimization process to achieve higher quality assets. The showcased cases in our experiments were also selected from commonly seen categories in daily life to demonstrate the generalizability of our method.
This paper proposes a two-stage method for 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In stage 1, only 3D prior is used. In stage 2, both 3D prior and 2D prior are dynamically combined. A CLIP-based similarity score changing rate determines the transition to stage 2. Qualitative and quantitative experiments show better texture and 3D consistency of the assets generated by the proposed method.
优点
-
The paper illustrates a good trade-off strategy of how to combine 3D and 2D prior in optimization.
-
The experiments show good qualitative and quantitative results, which demonstrate the superiority of the proposed method in generating assets with better texture and 3D consistency.
-
The paper is well-written and easy to follow.
缺点
-
Stage 1 and Stage 2 can potentially be merged into one stage by using a dynamic mechanism starting with pure 3D loss and the weight on 3D loss dropping slower at the first several iterations (e.g., with a larger ). It would be interesting to see the results comparing the proposed two-stage strategy with the merged one-stage strategy with an optimal .
-
The proposed method needs descriptions of the reference image in stage 2, while the comparison methods may not need such information.
-
In the Optimization Boundary Judgement, a CLIP-based changing rate determines the transition to stage 2. What if the changing rate of other straightforward metrics, e.g., PSNR, is being used instead of the CLIP score? Experiments are desired.
-
Missing citation of Score Distillation Sampling (SDS) near equation 7. Should mention "Dreamfusion (Poole et al., 2023)" here. In addition, are equations 4 and 7 losses or gradients?
问题
Please check the questions in weaknesses. I will consider raising the rating if the authors can respond to the questions well in the rebuttal.
Thank you for your time and helpful feedback. We respond below to your questions:
Weaknesses 1
We agree that, by using a dynamic mechanism and a proper , the two stages can be merged into a single one. Specifically, the combined one stage reconstructs the objects using equation , where the value of is 1.5. The results on RealFusion15 and C10 are shown in the following table.
| RealFusion15 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.844 | 25.682 | 0.056 |
| one stage | 0.825 | 24.812 | 0.068 |
| C10 | CLIP-similarity | PSNR | LPIPS |
|---|---|---|---|
| origin | 0.770 | 25.327 | 0.054 |
| one stage | 0.765 | 25.253 | 0.062 |
Weaknesses 2
Sorry we didn't specify the source of the text description. Except for Zero123, all other methods of comparison use a description of the reference image, which is obtained using the image caption model. We do the same, but we apologize for not using the icon of the imgae caption model in the pipeline(Figure 3) to illustrate the source of the textual description. We have fixed it in the revision.
Weaknesses 3
Thank you for the interest in our case-aware CLIP-based detection mechanism. We employ CLIP-similarity as it can serve as a metric for evaluating global structure. This is attributed to the fact that the distance between textual descriptions and the rendering of arbitrary viewpoints can be computed in the CLIP latent space. If opting for PSNR, we will calculate the PSNR value between the rendering of the current step and the last step from the same viewpoint. And the change rate of this PSNR value only accounts for the convergence of a certain perspective optimization, but fails to judge if an object's structure is sufficiently initialized. If you could kindly suggest metrics that enable the assessment of overall consistency across multi-view images, we would be delighted to explore these possibilities and present our findings.
Weaknesses 4
Thanks for the correction. We add the citation in our revision. About the equations 4 and 7, they are gradients, which are the same as described in DreamFusion. Our description of them as losses creates confusion, and we make correction in revision.
Dear authors,
Thanks for your detailed response and efforts in correcting the paper. Some of my concerns are addressed.
For Weaknesses 1, have the authors tried other (larger) values, e.g., 2, 3, or 5? Do they improve the one-stage performance?
Thank you for your reply! Sorry, we did not conduct experiment with larger values. This is due to the fact that if is too large, the weight of the 3D prior stays high throughout the stage. This does not fit well with our proposed idea of gradually letting the 2D prior dominate later in the stage. For example, when is equal to 2, the weight of the 3D prior is = 0.606 when t = T, and it does not even drop below 0.5. Since time and computational resource are not enough to support the completion of these experiments now, we will definitely add these experiments later and show them in the paper.