6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

4.8

置信度

正确性2.8

贡献度2.8

表达2.3

NeurIPS 2024

Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Kailu Wu,Fangfu Liu,Zhihan Cai,Runjie Yan,Hanyang Wang,Yating Hu,Yueqi Duan,Kaisheng Ma

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

TL;DR

A novel high quality and efficient image-to-3D framework.

摘要

关键词

image to 3d3d generationmesh generation

评审与讨论

审稿意见

评分: 7置信度: 52024-07-10

This paper introduces a novel method to reconstruct 3D from a single image. The method is two-stage: (i) 4 images corresponding to the orthographic views are first generated along with normal maps, (ii) then, the multi-view maps are used to initialize and optimize a mesh using differentiable rendering techniques which are adapted to this sparse-view setting. This second stage contrasts with recent image-to-3D methods which typically train large feed-forward reconstruction models. Extensive comparisons are conducted on (i) samples from original papers for a qualitative evaluation and (ii) on the standard Google Scanned Objects (GSO) dataset for quantitative results. In both cases, the proposed approach performs significantly better than the presented baselines.

优点

[S1] Novelty. The proposed approach is interesting and novel in different aspects. The first stage aims at generating multi-view images and normal maps from a single image; this idea is not novel per se and was introduced/studied by prior works like Wonder3D. However, it is worth noting that it technically differs a bit from Wonder3D and there is an additional effort to generate high resolution outputs. The second stage about 3D mesh reconstruction using differentiable rendering is elegant, effective and rather novel. In particular, it contrasts with recent image-to-3D methods which typically train large feed-forward reconstruction models tailored for this task.

[S2] SOTA performances. For both the qualitative and quantitative evaluations, the proposed approach showcases results that are significantly better than prior works. Visually, the method not only generates more detailed textures but also more accurate geometry.

[S3] Sound experiments. The authors conducted extensive experiments that are sound and validate the proposed method. They not only compare the method to an exhaustive set of state-of-the-art competitors both qualitatively and quantitatively, but also conducted ablation studies on some method components.

缺点

[W1] Unrigorous technical presentation. While the high-level overview of the approach is clear, some parts of the technical presentation lack clarity and contain incoherences. For example:

Eq (1) looks wrong: a depth map can be obtained by integrating gradients of normals not the normals themselves
Eq (5) is incoherent and lacks clarity: is $i$ an integer or an image? If $\mathcal{I}$ is a set of images, $i \in \mathcal{I}$ is an image, but $V_M(v, img)$ is undefined. The rightmost part of the equation $\sum_i V_M(v, \mathcal{I})$ does not make sense either. This part need revisions, $i$ can not be at the same time: an integer indexing the ground-truth views $\mathcal{I_m}$ , a sample view image from $\mathcal{I}$ or an integer indexing the sample views $\mathcal{I}$ . I would define this process for a single target views $I_t$ given the source views $I_s^1, ... I_s^N$ , and then define the loss in Eq (6) using the sum over the pseudo ground-truth target views.

This lack of rigour harms readability and resolving these issues would greatly strengthen the paper quality.

[W2] Missing ablation studies. I would expect more ablation studies to better understand the impact of each component that are different from prior works. Currently, the two ablation experiments correspond to two very technical aspects which are a regularization term (called Expansion) and the explicit target technique. Experiments assessing the impact of the following components are missing:

the image-to-multiview stage (e.g., compare results by replacing it with Wonder3D image-to-multiview stage)
the super-resolution (e.g., compare results with/without super-resolution network)
the multiview-to-3D stage (e.g., compare results by replacing it with Wonder3D multiview-to-3D stage)

问题

Questions:

Eq (2): how is the ground-truth mask computed?
What happens when the input view does not correspond to the frontal orthographic view, e.g. say with elevation=20 and azimuth=45?
L171 what are the details behind edge collapse/split/flip?

Remarks:

the terms "wild image", "wild views" do not sound correct, the appropriate term would be "in-the-wild image" but this typically corresponds to random real-world images that one could find on internet or social medias, which is not the case here. I would suggest removing this aspect which is not necessary for the paper storyline
the term "ExplicitTarget optimization" is not crystal clear out of context, I would suggest finding another term that clearly conveys what is under the hood, e.g. visibility-aware supervision

局限性

Yes, there is a brief limitation section. Including visuals illustrating failure cases would greatly help the readers better understand the model limitations.

作者回复

2024-08-07

We sincerely thank you for your time and valuable comments, some of which are fundamental and in-depth suggestions that help us greatly improve our paper. To address your concerns, we present the point-to-point response as follows.

Comment 1: Unrigorous technical presentation

We appreciate your observation regarding the clarity of the notations and explanations in Equations (1) and (5). In response to your feedback, we will make the revisions in the updated manuscript to enhance understanding.

For Equation (1), here $d(i,j)$ denotes the value of coordinate (i,j) in the depth map, while $x$ is the integration process along the vertical line $y=j$ , $\vec{n}(x)$ denotes the vector of the input normal field at the position $x$ . $n_x$ is the component of the normal field $\vec{n}$ along the direction of the x-axis (which is a scalar function). The confusion might arise because the letter $x$ is used in two different contexts. To clarify, we’ll make the following adjustments: We will use the formula $d(i,j)=∑t=0inx(t,j)d(i,j)=∑t=0inx(t,j)$ , where $d$ and $n_x$ are considered as matrices in the discrete version. This change aligns better with how the variables are handled in the code.

For Equation (5) (ExplicitTarget). Let $Avg(V, W) = \frac{\sum_i{V_i W_i}}{W_i}$ represent the weighted average function, and $V_{M}(v, i): (\mathbb{N}^+, \mathbb{N}^+) \rightarrow \{0, 1\}$ represent the visibility of vertex $v$ in mesh $M$ under view $i$ . $Col_M(v,i)$ Indicate the color of vertex v in viewpoint i. We compute the ExplicitTarget $ET$ of each vertex in mesh $M$ as

ET_{M}(v) = \\begin{cases} Avg(Col_M(v,i), V_{M}(v, i) W_M(v, i)^2) & \text{, if } \sum_{i} V_{M}(v, \mathcal{i}) > 0 \\\\ \mathbf{0} & \text{, otherwise, } \\end{cases}

where $W_M(v, i) = -\cos(N_{v}^{(M)}, N_{i}^{(view)})$ is a weighting factor, $N_{v}^{(M)}$ is the vertex normal of $v$ in mesh $M$ , and $N_{i}^{(view)}$ is the view direction of view $i$ .

$ET_M$ computes the results with view direction weighting so that each viewpoint does not introduce significant errors for surfaces that are too skewed in the current viewpoint (because predictions in these places are usually inaccurate).

Comment 2: Missing ablation studies

We appreciate the depth of your feedback. To address your concern, we will enrich our paper with the following additional ablation studies to elucidate the contributions of each component within our model:

(a) ISOMER Module Analysis: We will incorporate a comparative experiment of the ISOMER module to better demonstrate its superiority over existing reconstruction algorithms.

(b) Explicit Target Algorithm Analysis: We will show qualitatively how the ExplicitTarget algorithm improves the final performance.

(c) Robustness analysis: we will quantitatively study the performance and differences of our method under non-front inputs.

(d) Resolution Impact Analysis: We will expand our study to include a qualitative comparison across various resolutions in order to demonstrate the differences between different resolutions.

Question 1. Thanks for your insightful question. The ground truth mask is determined based on the normal map predicted by the model. Predicted pixels with an RGB magnitude far from 1 are considered background.

Question 2. Thanks for your insightful question. To resolve your concerns, we add a new test with rotated objects in Table 2 to test robustness in non-front-facing views. The test results show that our method still performs well in this case, and even the geometry prediction is more accurate.

Question 3. We appreciate your attention to this matter. In response, we will enhance the clarity and details of our explanation in the revised version as follows:

Edge Collapse: This operation is used to avoid and heal defects in the mesh. It involves selecting an edge within a triangle and collapsing it to the other edge, effectively merging the two triangles into a single triangle. This process can help to eliminate narrow triangles that might be causing issues in the mesh, such as those that are too thin to accurately represent the surface they are approximating. It prevents the creation of topological artifacts and maintains mesh quality.
Edge Split: This is the opposite of edge collapse. In the edge split, an edge that is longer than a specified maximum length is divided into two, creating new vertices at the midpoint of the edge. This operation is used to refine the mesh, ensuring that the local edge length is kept close to the optimal length. It helps to maintain the quality of the mesh by avoiding edges that are too long, which could lead to an inaccurate representation of the surface.
Edge Flip: Edge flip is an operation that adjusts the connectivity of the mesh to improve its quality. It involves flipping an edge within a triangle to connect two non-adjacent vertices, effectively changing the triangulation of the mesh. This can help to maintain the degree of the vertices close to their optimal value, which is typically six for internal vertices (or four for boundary vertices).

These operations aim to improve mesh quality, avoid defects, and ensure an accurate representation of the target geometry.

Remark 1. Thanks for pointing out the issue. We will modify the statement in our revision.

Remark 2. Thanks so much for your great suggestions. We are definitely considering adopting a name that is more straightforward and easily comprehended.

We appreciate your detailed comments. We will thoroughly revise our paper. Thanks again for your time and in-depth suggestions.

评论- Thanks for the rebuttal

2024-08-13

Thanks for the detailed answers. I have carefully read the rebuttal and my concerns were mostly addressed. One missing aspect is a discussion about failure cases and I strongly recommend adding some in the revised paper. In general, such an analysis is easy to build, it provides strong insights about the method performances and it can really drive the next iterations. Overall, I will keep my rating.

评论- Thank You

2024-08-14

Dear Reviewer,

We sincerely appreciate your thoughtful feedback and the time you have taken to carefully review our rebuttal. We are pleased to hear that our detailed responses have addressed most of your concerns.

Regarding your suggestion to include a discussion on failure cases, we fully agree with the value such an analysis would bring to the paper. It not only enhances the robustness of our method but also provides crucial insights for future improvements.

In the revised version, we will incorporate a comprehensive analysis of failure cases. This will include:

Analysis of inconsistent predictions across views.
Generation of geometric structures in unseen areas.
Simple colorize algorithm.
Inaccurate predicted normal maps.

We believe that this detailed analysis will significantly strengthen our paper and contribute to the advancement of the field.

Once again, thank you for your constructive feedback. We look forward to incorporating these improvements and hope to meet your expectations in the revised manuscript.

Best regards, Authors

评论- Correction to Typo

2024-08-08

Sorry for the messed up Latex in Equation (1) above, here is a corrected version.

Here $d(i,j)$ denotes the value of coordinate (i,j) in the depth map, while $x$ is the integration process along the vertical line $y=j$ , $\vec{n}(x)$ denotes the vector of the input normal field at the position $x$ . $n_x$ is the component of the normal field $\vec{n}$ along the direction of the x-axis (which is a scalar function).

The confusion might arise because the letter 'x’ is used in two different contexts. To clarify, we’ll make the following adjustments: We will use the formula $d(i,j)=\sum_t=0^i n_x(t,j)$ , where d and n_x are considered as 2D arrays in the discrete version. This change aligns better with how the variables are handled in the code.

2024-08-13

Dear Reviewer inWh:

Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?

Best, AC

评论- Thank You for Your Thorough Review and Invitation for Further Discussion

2024-08-13

Dear Reviewer,

Thank you for your patience and thorough review. We have addressed the issues you raised with detailed responses. We welcome you to engage in further discussion with us. Your insights are invaluable, and we are eager to clarify any points or provide additional information as needed.

Looking forward to your feedback and any further questions you may have.

Best regards, Authors

审稿意见

评分: 5置信度: 52024-07-13

The paper focuses on single-image-to-3D. Given a single image, it first finetunes Stable Diffusion Image Variations models to generate orthogonal multi-view RGB/normal images and ControlNet-Tile models to enhance resolution. Then, it proposes the instant and consistent mesh reconstruction algorithm (ISOMER). It has three stages: 1) Use the estimated depth (from predicted normal maps) of the front and back views to initialize the geometry. 2) Use differentiable rendering to optimize geometry with mask and normal loss. 3) Optimize geometry and vertex colors to fit the view-weighted color, where each view’s weight is the square of cosine between vertex normal and view direction. The total time required is claimed to be within 30 seconds. It compares with some recent approaches on 30 GSO objects and achieves better performance in terms of both visual and geometry quality.

优点

High-Resolution Appearance at Input View:

It demonstrates improvements in the resolution of the generated appearances, especially at the input view. Unlike previous works that produced lower-resolution multi-view outputs, resulting in a lack of clarity and detail, this paper utilizes diffusion models to achieve effective super-resolution. This allows the generated views to maintain high resolution and exhibit more details (reimagined though).
Enhanced Geometry:

The paper showcases impressive geometry, particularly in examples of garage kit figures. Technically, this work better leverages predicted normal map information to optimize geometry and color, surpassing previous models like Wonder3D, which also predicted normal maps. This enhanced utilization of normal maps carves more geometric details.
Many Details and Supporting Materials:

It provides many network parameters, predicted meshes, and rendering videos, which are valuable for assessing the performance and effectiveness of the proposed methods.

缺点

Limited Qualitative Results to Frontal View, Worse Side Views:

The majority of qualitative results presented in the paper are focused on the frontal view, as it is prioritized by the approach. However, for non-frontal views, there are issues such as visible seams and artifacts. For example, the video in supplementary material shows several instances where characters have duplicated ears. Although the frontal view quality is enhanced, this improvement comes at the expense of other views, which suffer from more noticeable artifacts.
Potential Bottleneck by the Two-view Initialization of First Step

Following the previous point, those artifacts are potentially influenced by the algorithm’s initialization process since only front and back views are used. Artifacts like dual-ear, if introduced in the first step of ISOMER, may not be easily corrected via subsequent refinement. What if the input view is not exactly frontal and more ill-posed? The artifacts can be even more severe.
Narrow Evaluation Scope, Limited Geometry Complexity:

The paper predominantly evaluates small cartoon figures and animal models in terms of qualitative evaluation, with limited representation of real-world objects encountered in daily life. Real-world objects may have more complex geometry, where four orthogonal views are not enough to cover. Even for simple objects like a mug, the method may fail to generate a solid bottom and an empty interior given that they are unseen.
Pose Sensitivity and Distortion Issues:

The network outputs exhibit significant distortions when input images are not taken from a frontal view with the elevation equal to zero. For instance, a machine example in the supplementary video shows clear distortion when the input image is taken from an elevated side angle. This indicates a lack of robust data augmentation during training, leading to heavy restrictions on input poses.
Oversimplified Texture Handling of Occluded Region:

The approach to unseen vertex coloring is overly simplistic, using flood-fill colors to interpolate occluded areas. This may result in incorrect color transitions for complex geometries with significant occlusions. Even for simple cases, the bottoms or tops of objects are often unseen from 4 orthogonal views. Vertex colors in those regions are roughly smoothed out.
Small Evaluation Set:

The evaluation set consists of only 30 objects, which is insufficient, especially considering all of those baselines are not purely optimization-based, whose inference time costs per image are no longer than a few minutes. A few hundred objects will be necessary. For example, GRM [63] uses 250 shapes for single-view reconstruction evaluation, providing more robust and reliable assessments.
Unclear Writing:

The paper contains several sections where the writing is unclear. Please refer to the points listed in the Questions.
Lack of Quantitative Ablation Studies:

It does not include any quantitative evaluation studies.

问题

Please properly cite, acknowledge, and compare with previous works if similar ideas are applied. For example, in the Multi-view Image Generation, two key points (L499-502) seem to be the same as ImageDream [15]. And, it also uses IP-Adapter (L515). A reference U-Net is also used in Zero123++ [46]. Please cite related work and explain the differences when introducing the proposed method. This helps readers understand how ideas are inherited and improved.
No notations in Eq. 1 are explained. I can understand the point. But it is not reader-friendly. Please delete it or explain it. The formula itself is also over-simplified.
Wrong references of figures, and algorithms. L181, L188, L557, etc.
In Algo. 1, the variable cnt is useless and not explained. The set ‘colored’ is initialized as empty but assumed as initialized by C (L13). The authors may consider putting it more clearly: For invisible vertex coloring, it applies the flood fill algorithm, where each vertex is colored by the average color of the neighboring colored vertices. Use proper terms and make your writing concise.
Is Expansion (L181) referring to applying normal-based loss L_normal?
L204: the result for vertex v → the predicted color of vertex v. Make it more clear.
Time cost of each ISOMER step.

局限性

Many significant limitations of the approach are not discussed. Please refer to the Weaknesses.

作者回复

2024-08-07

Comment 1: Limited Qualitative Results to Frontal View, Worse Side Views

Thanks for bringing this comment up, and we acknowledge the presence of this issue. Our investigation indicates that it stems from occasional inconsistencies in the multi-view predictions, which is a challenge inherent in all current multi-view reconstruction solutions. We are actively exploring ways to mitigate these inconsistencies to enhance the robustness and reliability of our reconstruction process.

Comment 2: Potential Bottleneck by the Two-view Initialization of First Step

Thanks for your insightful question. As introduced in our main text, the initialization process is specifically designed to guarantee the correct overall topological structure and is unrelated to reconstruction details. The majority of artifacts are primarily derived from inconsistencies in multi-view predictions rather than the reconstruction process. As shown in Figure 9, our method can yield satisfactory results even with a sphere as the initial shape. Furthermore, Figure 5(a) demonstrates that the ISOMER reconstruction step is capable of correcting some of the artifacts introduced by multi-view predictions. This showcases the resilience and corrective capabilities of our approach in refining the final reconstruction output.

Comment 3: Narrow Evaluation Scope, Limited Geometry Complexity

Thanks for your insightful question. For a consideration of fairness, the samples we have chosen for qualitative evaluation are largely based on previous methods, thus not incorporating diverse real-world objects. However, in our quantitative evaluation using the GSO dataset, a set of real-world objects scanned by Google, our method significantly outperforms existing methods. Our approach excels in capturing fine details, showcasing our proficiency in reconstructing complex geometries with high fidelity.

Regarding geometry complexity, our method is comparable to existing works such as Wonder3D, OpenLRM, and InstantMesh. It's important to note that, like these methods, we cannot generate unseen features such as "a solid bottom and an empty interior of a mug" when these aspects are not present in the input views. This limitation is inherent to the field and represents a challenge for all methods dealing with incomplete view data.

Comment 4: Pose Sensitivity and Distortion Issues

Thanks for pointing out the issue. To resolve your concerns, we add a new test with rotated objects in Table 2 to test robustness in non-front-facing views. The test results show that our method still performs well in this case, and even the geometry prediction is more accurate.

Comment 5: Oversimplified Texture Handling of Occluded Region

We appreciate your detailed comments. We agree that our current method for handling colors in non-visible areas is quite straightforward. The simplicity of this approach allows for extremely efficient processing times, ensuring the overall efficiency and stability of our workflow. We are committed to exploring more sophisticated coloring techniques in future work to enhance the visual output of our reconstructions.

Comment 6: Small Evaluation Set

Thanks for your thorough comment. We chose 30 random objects in alignment with the standards set by previous work (CRM, SyncDreamer, Wonder3D). Given that SyncDreamer takes up to 50 hours to generate 30 objects and results from other papers show that SyncDreamer performs significantly worse than existing methods, we exclude SyncDreamer and conduct an experiment on the entire GSO dataset. The results for all methods are provided in Table 1.

Comment 7: Lack of Quantitative Ablation Studies

We are grateful for your comprehensive comments. Recognizing the need for a more robust ablation study, we have taken the following steps to enhance our original analysis:

(a) ISOMER Module Analysis: We will incorporate a comparative experiment of the ISOMER module to better demonstrate its superiority over existing reconstruction algorithms.

(b) Explicit Target Algorithm Analysis: We will show qualitatively how the ExplicitTarget algorithm improves the final performance.

(c) Robustness analysis: we will quantitatively study the performance and differences of our method under non-front inputs.

2024-08-13

Thanks for the detailed rebuttal!

Comment 1/2: If the error mainly originated from multi-view inconsistency, why are there more seams and artifacts like dual ears at the sides instead of on the front view?

Comment 3: It was stated that the unseen features pose a challenge for all methods. However, I don't fully agree with this. For example, I think methods like CLAY (Rodin Gen-1) or One-2-3-45++ can address this issue effectively with 3D supervision, so such challenges wouldn't arise in these cases.

Comment 4: Are the input views in the test rendered at elevation=0? Are rotations only horizontal? Currently, this method shows noticeable distortion given non-zero elevation input views, which doesn't occur in works like InstantMesh.

评论- Respones to More Questions

2024-08-13

Comment 1/2:

Thank you for your insightful observation regarding the presence of seams and artifacts. As mentioned, our code implementation was inspired by the Wonder3D's code, where we assigned different weights to different views during the ExplicitTarget calculation: 2.0 for the front, 1.0 for the back, and 0.8 for the other sides. This weighting strategy helps to mitigate artifacts on the front view, aligning more closely with human preferences by maintaining higher quality in the most visible areas. We believe this approach enhances the overall visual quality, especially in cases where multi-view images are inconsistent.

Comment 3:

Your perspective on the handling of unseen features by methods like CLAY (Rodin Gen-1) is insightful. Indeed, CLAY's explicit 3D representation and explicit 3D supervision effectively address the issue of unseen features, producing robust structures in invisible areas. While CLAY may not ensure high consistency with the input views, it excels in generating detailed unseen regions. Integrating such strengths into our future work is a promising direction for us.

Regarding One-2-3-45++, its non-open-source nature prevents direct comparison. However, strikingly similar to One-2-3-45++, Instant-Mesh also employ Zero123++ for six-view generation and subsequent SDF generation with multi-view differentiable rendering supervision, face challenges such as generating "a solid bottom and an empty interior of a mug". For instance, Instant-Mesh fails to produce a empty interior of a mug due to the inability to observe the interior from surrounding views. This limitation suggests that One-2-3-45++ might also struggle with similar issues, as it does not inherently circumvent the problem of unobserved regions.

CLAY is an excellent job that can't be ignored. However, it is important to note that CLAY was published on arXiv a week after our submission deadline, hence it was not included in our current discussion. Our work focuses more on geometric details and external visual effects of objects, contrasting with CLAY's emphasis on overall structural generation. We will incorporate a discussion of these related works into the refined version of our paper.

Comment 4:

We acknowledge your concern about the potential distortion observed in non-zero elevation input views. In the rebuttal appendix, additional tests on object rotation sampling follow $azimuth \in U[-180, 180], elevation \in U[-30, 30]$ , rather than $elevation=0$ . We have examined the test cases and did not find significant distortion in the non-zero elevation input views. We will supplement the multi-view visualization results under these conditions in the revised version. It is worth noting that if random rotations are not included during model training, the trained model will exhibit noticeable distortion in the non-zero elevation input views. Thank you for pointing this out!

Thank you very much for your questions and suggestions, and we welcome further discussion!

评论- Further discussion on distortion and elevation

2024-08-13

Comment 4: Pose Sensitivity and Distortion Issues: "For instance, a machine example in the supplementary video shows clear distortion when the input image is taken from an elevated side angle."

We have thoroughly reviewed the "a machine example" you mentioned and have not found any significant distortion in the generated results. The input view of this example has an azimuth of approximately 60 degrees and an elevation of about 15 degrees. The generated model also precisely aligns with these angles, exhibiting an azimuth of around 60 degrees and an elevation of 15 degrees, which perfectly matches the rotation angles of the input object. Furthermore, the generated geometric structure is entirely consistent with the input image. On the contrary, this example actually demonstrates the robustness of our model in handling objects with rotation.

评论- Responses to questions

2024-08-08

Question 1:

We greatly appreciate your insightful feedback. In light of your suggestions, we will revise the introduction in the updated paper to clearly delineate our approach. Similar to 3D generation methods like Wonder3D and InstantMesh, our work draws inspiration from ImageDream and Zero123++ for multiview image generation. However, we employ two denoising models and a super-resolution model to accomplish multi-view generation with notable distinctions as follows:

(a) Orthogonal View Generation: Our first denoising model, similar to ImageDream, generates four orthogonal views instead of perspective views, which simplifies the subsequent reconstruction process and enhances multi-view consistency. The choice of orthogonal views facilitates a direct geometric correlation between pixels. Moreover, we integrate learnable class embedding to encode view information, enhancing the model's ability to understand and process multi-view inputs.

(b) IP-Adapter Utilization: Contrary to being used in the Multi-view Image Generation step, our method employs an IP-Adapter in the Multi-view Image Upscale step. We incorporate a controlnet model with IP-Adapter to allow for the enhancement of multi-view details and achieve the targeted resolution.

(c) Normal Prediction Setting: In the Normal Prediction module, we adopt a denoising model with a reference U-Net, akin to the concept in Zero123++. However, unlike Zero123++, which shares all weights between the reference U-Net and the main network, we utilize an independent pre-trained reference U-Net. We freeze all parts within the reference U-Net except for self-attention (Lines 544-545) to preserve the generalization ability from the pre-trained model.

Question 2:

Thanks a lot for pointing out! Here $d(i,j)$ denotes the value of coordinate $(i,j)$ in the depth map, while $x$ is the integration process along the vertical line $y=j$ , $\\vec{n}(x)$ denotes the vector of the input normal field at the position $x$ . $n_x$ is the component of the normal field $\\vec{n}$ along the direction of the x-axis (which is a scalar function).

The confusion might arise because the letter ‘x’ is used in two different contexts. To clarify, we’ll make the following adjustments: We will use the formula $d(i,j)=\\sum_{t=0}^i n_x(t,j)$ , where d and n_x are considered as 2D arrays in the discrete version. This change aligns better with how the variables are handled in the code.

Question 3:

Thanks for pointing out the issue. We will correct these mistakes in the revision.

Question 4:

We thank the reviewer for raising this question, and we deeply regret the lack of detailed comments on the appendix's algorithm which has caused confusion. The variable cnt is crucial as it determines the number of while loop iterations at Line 11 (L11). It is used to record the number of iterations before reaching Line 21 (i.e., all colors are applied), and afterward, the iteration repeats cnt times to guarantee the color completion process is completed. Besides, we acknowledge the mistake at Line 3 (L3), where the initialization of 'colored' should indeed be based on the Inv. In the invisible vertex coloring, a flood fill operation is applied on the array 'colored' to fill the colors, while an ongoing laplacian smoothing is performed on the array C[i] to achieve a smoother color transition for a more aesthetically pleasing result. We appreciate your correction and will ensure future versions have clearer documentation to prevent misunderstandings.

Question 5:

The "Expansion" in our model serves as a regularization technique directly applied to the parameters, similar to weight decay, rather than functioning as a loss term. At each step, vertices are moved a small distance in the direction of their normals.

Question 6:

Thanks for pointing out the issue. We will modify this statement in our revision following your suggestion.

Question 7:

Thanks for your thorough comment. The entire ISOMER process takes approximately 10 seconds. Within this timeframe, the Mesh Initialization step accounts for about 2 seconds, the preliminary Mesh Reconstruction takes around 3 seconds, the Mesh Refinement step consumes approximately another 5 seconds, and the final Mesh Colorization is completed in less than 0.1 seconds.

We appreciate the reviewer for pointing out those issues. We will thoroughly revise our paper. Thanks again for your time and in-depth suggestions.

2024-08-13

Dear Reviewer GBd8:

Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?

Best, AC

评论- Thank You for Your Thorough Review and Invitation for Further Discussion

2024-08-13

Dear Reviewer,

Looking forward to your feedback and any further questions you may have.

Best regards, Authors

审稿意见

评分: 6置信度: 52024-07-14

This paper proposes a novel method for converting a single image to 3D. The method mainly consists of two stages: multi-view RGB and normal generation, and multi-view guided mesh optimization and texturing. The key innovation of the paper is a multi-view-normal-based 3D mesh reconstruction module. Specifically, given a single input image, the method first generates multi-view RGB images by fine-tuning a 2D diffusion model. It then employs a ControlNet-tile model and a super-resolution model to increase the resolution of the multi-view images from 256 to 2048. Additionally, a multi-view normal model is fine-tuned to generate the corresponding multi-view normal maps. Following this, a module called ISOMER is proposed to first convert the multi-view normal maps into a 3D mesh by fusing the normal maps to depth and then applying Poisson surface reconstruction. The mesh geometry is then refined according to a specialized multi-view normal loss. Finally, the mesh vertex color is optimized through a similar approach.

优点

The proposed method is well-motivated. Previous methods attempt to directly convert multi-view RGB images to 3D through a feed-forward model, which is harder to train and requires more computing resources. The proposed method leverages the multi-view normal information from a 2D diffusion model to directly optimize 3D geometry, avoiding expensive training.
The proposed multi-view normal to 3D module is novel and interesting.
The paper is generally well-written and easy to follow.

缺点

While I agree that the proposed method is technically sound and may be a nice supplement to the community, my major concern is that the experimental section is quite lightweight. For instance, the quantitative experiment is only conducted on 30 objects from the GSO dataset. Since the proposed method and the comparing baseline are not expensive to run compared to SDS-based methods, it is not acceptable to base quantitative results on only 30 objects. It's very easy to get biased results, and the conclusions are likely to change if we choose another set of 30 objects. As a result, I strongly urge the authors to follow the conventions of previous papers and rigorously compare the methods on the entire GSO dataset with careful alignment between prediction and ground truth before calculating metrics. Additionally, it is highly suggested to include evaluations on other datasets, especially some real-world object datasets, and user studies. Otherwise, I cannot support the acceptance of the paper without convincing justification.
The high-level pipeline of "multi-view normal generation with 2D diffusion models and 3D reconstruction with normal-based optimization" is not first proposed in this paper. For example, Wonder3D shares a very similar high-level pipeline, which greatly limits the contribution or novelty of the paper. While the paper proposes a novel and interesting reconstruction module, ISOMER, it has not been carefully analyzed separately. For instance, I would like to know whether the performance gains come from better multi-view prediction or a better multi-view normal-to-3D reconstruction module. An interesting experiment to include would be a direct comparison between the reconstruction modules of Wonder3D and ISOMER given the same multi-view normal maps. Only after a detailed ablation study can readers choose a better multi-view normal generation module and a better reconstruction module.
The multi-view generation part lacks significant novelty, mainly consisting of existing known techniques related to 2D diffusion models.
The ablation study is quite lightweight, and many important experiments are missing:

(a) The ISOMER module includes multiple stages. I suggest the authors include both the intermediate and final results of the module. A quantitative evaluation would also be beneficial.

(b) The method utilizes multiple 2D diffusion models to upscale the multi-view RGB (and normal?) images. It would be interesting to see the multi-view results before and after upscaling (resolutions at 256, 512, and 2048) and their impact on the final 3D models.

(c) For the "explicit target," only the geometry results are shown. What is the effect on texture?
The paper claims that "generate tens of millions of faces within seconds" as an advantage. However, I don't believe so. First, it's not difficult for existing methods to generate dense meshes efficiently. Also, dense meshes themselves are not required by applications but rather detailed and sharp geometry. In fact, many downstream applications prefer meshes with more compact faces and cannot tolerate tens of millions of faces.
Line 170 states, "Finally, the mesh is corrected after iteration through edge collapse, edge split, and edge flip to maintain a uniform face distribution and reasonable edge lengths." This introduction is too brief as it involves many operations but lacks the motivation and implementation details for each operation.

问题

When converting normal maps to depth maps, how do you handle discontinuity issues? For example, there may be occluded regions, leading to sudden jumps in the normal map.
The method uses two models to upscale the resolution, first from 256 to 512 and then to 2048. Do we really need two models? Can we directly upscale from 256 to 1024? Or can we change their order?
It seems that the mesh vertex colors are directly computed as a weighted sum of the projected colors. Am I correct? Will this cause any inconsistency issues or other artifacts?
Line 133: "we adopt a channel-wise noise offset strategy" is not very clear to me. Could you provide more details?
Line 160: "not yield a real normal field which is irrotational. To address this ..." is not clear to me. Could you explain this further?
Line 181: What does Figure 3(b) refer to?
Line 224: "By examining the epipolar lines corresponding to each horizontal ray, we identify 13k instances of illegitimate data." This is not very clear to me. Could you provide more details?
Line 241: What does "second level of training" refer to?

局限性

The authors briefly mention the limitations in the final section.

作者回复

2024-08-07

We sincerely appreciate your constructive and thorough comments. Your main suggestions on the experimental setting and ablation studies help us refine our paper. To address your concerns, we present the point-to-point response as follows.

Comment 1: Narrow Evaluation Scope

Thanks for your thorough comment. GSO (Google Scanned Objects) is a set of real-world objects scanned by Google using specialized equipment for research. We chose 30 random objects in alignment with the standards set by previous work (CRM, SyncDreamer, Wonder3D). Given that SyncDreamer takes up to 50 hours to generate 30 objects and results from other papers show that SyncDreamer performs significantly worse than existing methods, we exclude SyncDreamer and conduct an experiment on the entire GSO dataset. The results for all methods are provided in Table 1. Additionally, following your suggestions, we will add more experiments on other real-world object datasets, such as MVImageNet, and provide a detailed evaluation in the revised paper.

For the user study, as shown in Appendix F, we rendered 360-degree videos of subject-driven 3D models and presented each volunteer with five samples of rendered videos from a random method. Volunteers rated the samples on four aspects: 3D consistency, subject fidelity, prompt fidelity, and overall quality on a scale of 1-10, with higher scores indicating better performance. We collected results from 30 volunteers, as shown in Table 2 in the Appendix. Our findings indicate that users significantly preferred our method across these aspects.

Comment 2: Insufficient Analysis of ISOMER

We appreciate your insightful comment. Addressing the concern about the novelty in the multi-view generation aspect, we will provide a unified response in the subsequent question. To resolve your concern regarding ISOMER, we will revise its introduction to emphasize the key insight of each component within ISOMER. The first step of ISOMER ensures topological consistency, while the second step efficiently reconstructs rough geometries, and the third step improves surface reconstruction accuracy by handling multi-view inconsistencies with ExplicitTarget.
Our experiments indicate that ISOMER is capable of enhancing consistency as demonstrated in Figure 5. We include a new comparative evaluation experiment that contrasts Wonder3D with and without ISOMER to provide a clearer understanding of the benefits in Table 1.

Comment 3: Lack of Novelty in Multi-view Generation

We thank the reviewer for raising this concern. The main focus of our paper is on addressing current limitations within current multi-view reconstruction approaches, exemplified by Wonder3D. Existing methods often suffer from low resolution, multi-view inconsistency, and low generation speed. Our methodology is specifically tailored to address each of these issues within the existing framework, rather than proposing an entirely new pipeline.

Here's a concise summary of our approach:

Noval Reconstruction Algorithm: For the first time in such a pipeline, we have introduced a mesh-based reconstruction algorithm, which has gained significant improvements in speed and quality.
Quality Enhancement: We have developed techniques to improve the resolution of reconstructed models, providing higher fidelity and detail.
Consistency Improvement: Our approach includes mechanisms to reduce inconsistencies across different views, leading to more coherent and reliable 3D reconstructions.

Comment 4: Lightweight Ablation Studies

We are grateful for your comprehensive comments. Recognizing the need for a more robust ablation study, we have taken the following steps to enhance our original analysis:

(a) Quantitative Evaluation of the ISOMER Module: We have incorporated a detailed quantitative assessment of the ISOMER module to better demonstrate its contribution to overall performance.

(b) Resolution Impact Analysis: We have expanded our study to qualitative comparisons across various resolutions to see how different resolutions impact the final 3D reconstruction outcomes.

(c) Explicit Target's Effect on Texture: Mirroring the approach used for geometry, we will include additional experimental results that illustrate the impact of the Explicit Target method on texture quality in the revised manuscript. We show the difference in Figure 1.

Comment 5: Advantage of Generating Dense Meshes

We thank the reviewer for raising this concern. We believe that detailed and sharp geometry does require a sufficient number of surfaces. As illustrated in Figure 6 of the main text, the intricacies of text engraving cannot be achieved without enough surfaces. For scenarios that prefer more compact meshes, such as in gaming or other graphics-intensive applications, various existing mesh decimation and retopology techniques can be applied to achieve the desired outcome. Since this process is quite engineering-oriented, we did not include it within the scope of our paper. It is worth noting that while simplifying a high-precision dense mesh model to a more compact one is feasible, the inverse—enhancing a model with fewer details to a high-precision one—is significantly more challenging. This consideration is a primary reason our paper emphasizes the optimization of generation accuracy and surface count, aiming to address the complexities involved in achieving high-fidelity geometric detail.

Comment 6: Lack of Detailed Explanations

Thanks for pointing out the issue. We will modify our statement in the revision.

评论- More Responses

2024-08-08

We will add more details about edge collapse, edge split, and edge flip in the revision as follows.

Edge Collapse: This operation is used to avoid and heal defects in the mesh. It involves selecting an edge within a triangle and collapsing it to the other edge, effectively merging the two triangles into a single triangle. This process can help to eliminate narrow triangles that might be causing issues in the mesh, such as those that are too thin to accurately represent the surface they are approximating. Edge collapse can prevent the creation of topological artifacts and maintain the quality of the mesh.

Edge Split: This is the opposite of edge collapse. In edge split, an edge that is longer than a specified maximum length is divided into two, creating new vertices at the midpoint of the edge. This operation is used to refine the mesh, ensuring that the local edge length is kept close to the optimal length. It helps to maintain the quality of the mesh by avoiding edges that are too long, which could lead to an inaccurate representation of the surface.

Edge Flip: Edge flip is an operation that adjusts the connectivity of the mesh to improve its quality. It involves flipping an edge within a triangle to connect two non-adjacent vertices, effectively changing the triangulation of the mesh. This can help to maintain the degree of the vertices close to their optimal value, which is typically six for internal vertices (or four for boundary vertices).

The goal of these operations is to improve the mesh quality while avoiding defects and ensuring that the mesh accurately represents the target geometry.

Question 1:

Thanks for your questions. The edges of sudden jumps will have a steep normal, and integrating over this normal gives a large depth difference. So this does not have an observable negative impact on the algorithm.

Question 2:

Thanks for your insightful questions. The 256-to-512 model is specifically tuned to integrate information from multiple views, which is crucial for ensuring consistency across different perspectives. The 512-to-2048 model, on the other hand, is optimized to concentrate on the finer details of the reconstruction, enhancing the overall quality of the output. Since the multi-view aware 256-to-512 bears a relatively higher computational load compared to the 512-to-2048 model, it is part of our strategic design to balance computational load with multi-view consistency and accuracy. This dual-model strategy is crafted to enhance efficiency in both training and inference, making our method more practical for real-world applications.

Question 3:

Yes. Since it is computed as a weighted sum, we rarely encounter inconsistency issues. However, It is indeed a promising direction to explore more advanced coloring methods in future work to further enhance the robustness and quality of our results.

Question 4:

Thanks for pointing this out, we'll be more detailed in the revised version! For a noisy latent with shape [B, C, H, W], we further add a [1, C, 1, 1] shaped N(0, 0.1) gaussian noise to it, thus enhancing the generalization of the network and avoiding the zero terminal SNR problem.

Question 5:

We thank the reviewer for raising this concern. For a legitimate normal field, any closed line integral should be zero, which is the meaning of being irrotational. However, predictions from neural networks cannot possibly meet this condition.

Question 6:

Thanks for pointing out the typo. We will correct it to Figure 5(b).

Question 7:

Thanks for your valuable comments. We will add more details as follows.

Because it is four views that are in the same plane, there is an obvious pairwise polar geometric relationship, i.e., any pixel in one view corresponds to a horizontal straight line in the neighboring view. Thus if a non-null pixel, does not have any non-null pixels on the corresponding straight line in the neighboring view, then the data is ilegal. This problem is usually caused by objects in the data that have no thickness, i.e., they are observable in one view and happen to be invisible in another view.

Question 8:

It refers to Multi-view Image Upscale, with a detailed description in Appendix B.

We thank the reviewer for pointing out these typos. We will carefully proofread the manuscript and sincerely hope that you will find the revision satisfactory. We appreciate your time and insightful comments.

评论- thank you

2024-08-11

Thank you for the detailed response and additional experiments.

After reviewing your comments, I have the following questions and concerns:

It appears that Wonder3D+ISOMER performs slightly better than Wonder3D alone but still falls short (a lot) of Unique3D. Does this imply that the primary improvement in Unique3D comes from the multi-view generation (2D diffusion models) rather than the reconstruction model (ISOMER)? Could you explain the significant performance gap between Wonder3D+ISOMER and Unique3D? Additionally, would combining Unique3D's multi-view prediction with Wonder3D's reconstruction model result in better performance than "Wonder3D+ISOMER"? I'm asking because I want to understand whether the improvement is due to the reconstruction method or the multi-view prediction. Based on the current results, it's difficult to determine.
The authors claim that their multi-view module offers higher resolution, better multi-view consistency, and faster generation speed. However, the last two points are not supported by experiments. How do you quantitatively measure multi-view inconsistency?
When directly calculating the point color as a weighted sum of projected 2D pixels, why doesn't this method suffer from inconsistency, especially at the boundaries or overlaps between multiple views?
I still don't fully get the point. When there is a sudden depth change (e.g., due to occlusion), how can simply integrating the normals yield the correct depth? I don't think there would be steep normals; rather, there should be multiple segments of normals reflecting the surface properties of each region separately.
Regarding the generation of dense meshes, I agree that generating sharp features requires a sufficient number of faces. However, the ability to export a large number of triangles doesn't necessarily indicate that the method can generate sharp and detailed geometry. For instance, existing methods can increase their resolution to 512 or 1024 when using the Marching Cubes algorithm, which will produce many more triangles, but the underlying geometry remains unchanged. My point is that you should only claim the generation of sharp details (with verification) as an advantage, not just the generation of a large number of triangles.
Where is the "Quantitative Evaluation of the ISOMER Module"? I couldn't locate it in the rebuttal PDF.
Regarding the "irrotational normal", what is the motivation behind "introducing a random rotation to the normal map before integration? The process is repeated several times, and the mean value of these integrations is then used to calculate the depth, providing a reliable estimation." How does this solve the problem?
In the rebuttal, the qualitative examples provided are limited to one or two instances, which is not very convincing or helpful for understanding. Please avoid this and include more examples in your revision.

评论- Response to questions

2024-08-13

We sincerely appreciate your constructive and thorough questions. To address your concerns, we present the point-to-point response as follows.

It appears that Wonder3D+ISOMER performs slightly better than Wonder3D alone but still falls short (a lot) of Unique3D. Does this imply that the primary improvement in Unique3D comes from the multi-view generation (2D diffusion models) rather than the reconstruction model (ISOMER)? Could you explain the significant performance gap between Wonder3D+ISOMER and Unique3D? Additionally, would combining Unique3D's multi-view prediction with Wonder3D's reconstruction model result in better performance than "Wonder3D+ISOMER"? I'm asking because I want to understand whether the improvement is due to the reconstruction method or the multi-view prediction. Based on the current results, it's difficult to determine.

Thank you for your question. We tested the Unique3D multi-view + Wonder3D reconstruction, and the results are as follows: | Method | PSNR↑ | SSIM↑ | LPIPS↓ | Clip-Sim↑ | Chamfer Dist.↓ | Vol. IoU↑ | F-Score↑ | | --- | --- | --- | --- | --- | --- | --- | --- | | Wonder3D | 18.0932 | 0.8995 | 0.1536 | 0.8535 | 0.0261 | 0.4663 | 0.6016 | | Wonder3D+ISOMER | 18.6131 | 0.9026 | 0.1470 | 0.8621 | 0.0244 | 0.4743 | 0.6088 | | Unique3D+Wonder3D | 19.1688 | 0.9219 | 0.1107 | 0.8732 | 0.0153 | 0.5232 | 0.6576 | | Unique3D | 20.0611 | 0.9222 | 0.1070 | 0.8787 | 0.0143 | 0.5416 | 0.6696 |

We analyzed the results:

Our multi-view training part differs from Wonder3D in data filtering, training strategies, and network architecture. For example, about 37% of the data in Wonder3D's data list cannot pass our data filtering (mainly due to issues like thicknessless surfaces, one-sided visibility, or too small projected areas), which may have led to different learned preferences.

We also observed that the main reason for Wonder3D's low scores is the overly flat or thick predictions (often problematic with elongated objects like shoes or flat objects like books), which is a significant factor in Wonder3D's lower scores. Such errors are less observed in our multi-view predictions. Based on the above analysis, it is consistent with your prediction. It can be said that the main improvement of qualitative comparisons comes from the accuracy of multi-view prediction. These indicators are not sensitive to geometric details and are more focused on evaluating basic geometric structures. Therefore, ISOMER has relatively limited improvement under these indicators, but we have observed that the visual effect of Wonder3D+ISOMER is much better than Wonder3D. We are also actively looking for a better metric to evaluate geometric details.

The authors claim that their multi-view module offers higher resolution, better multi-view consistency, and faster generation speed. However, the last two points are not supported by experiments. How do you quantitatively measure multi-view inconsistency?

Thank you for your insightful question. The faster generation speed refers to the comparison between generating multi-view at 512 resolution directly and combining the generation at 256 resolution with upscaling to 512. Theoretically, the computational load of the latter is reduced by 55% compared to the former (since the computational load of 512 is four times that of 256). Regarding multi-view consistency, it is compared to the directly apply super-resolution from 256 to 2048. Since directly apply super-resolution does not include information from other views, it introduces more inconsistencies. We will clarify this in the revised version to avoid any ambiguity in understanding.

When directly calculating the point color as a weighted sum of projected 2D pixels, why doesn't this method suffer from inconsistency, especially at the boundaries or overlaps between multiple views?

This is why the reconstruction method includes a Reconstruction Stage and a Refine Stage in our method. The Reconstruction Stage, which does not contain ExplicitTarget, quickly produces a model that approximates the correct shape but is limited by multi-view consistency. ExplicitTarget further addresses this issue in the Refine Stage. A straightforward understanding can be seen in Fig. 1 of the Rebuttal material, where ExplicitTarget is used for direct coloring. If the coloring is replaced with multi-view normal maps, it becomes the optimization target for each step in the Refine Stage. Without ExplicitTarget, the optimization would face inconsistencies due to multiple views, whereas ExplicitTarget does not have this problem. At the contour edges of each view, the weights used in the ExplicitTarget calculation are based on the angle between the normals, so that the typically lower weights at these edges. Additionally, since the adjacent views are orthogonal in a four-view setup, a vertex's color is mostly influenced by only one view, preventing incorrect superposition.

评论- Remaining responses

2024-08-13

I still don't fully get the point. When there is a sudden depth change (e.g., due to occlusion), how can simply integrating the normals yield the correct depth? I don't think there would be steep normals; rather, there should be multiple segments of normals reflecting the surface properties of each region separately.

The training data includes steep edge normals (the multi-view normals in the data are derived from depth maps, as direct surface normals would yield incorrect data due to many surfaces being wrong direction in Objaverse). Therefore, we observe such normals during generation, although they may not be accurate (as shown in Appendix Fig. 8). However, the accuracy of the initial depth estimation is not crucial for the final result (as demonstrated in Appendix Fig. 9); what matters is the topological holes, not the accuracy. The topological holes are reflected by the accuracy of the normal maps. We expect to explore more accurate initialization methods to improve the final method's accuracy in the future.

Regarding the generation of dense meshes, I agree that generating sharp features requires a sufficient number of faces. However, the ability to export a large number of triangles doesn't necessarily indicate that the method can generate sharp and detailed geometry. For instance, existing methods can increase their resolution to 512 or 1024 when using the Marching Cubes algorithm, which will produce many more triangles, but the underlying geometry remains unchanged. My point is that you should only claim the generation of sharp details (with verification) as an advantage, not just the generation of a large number of triangles.

We fully agree with your statement. A large number of triangles are merely a prerequisite for sharp details, not indicative of sharp details themselves. As shown in Fig. 3 of the main text, existing methods fail to achieve good sharp details in geometry, largely because they use methods like Marching Cubes to extract meshes. For example, replacing the reconstruction in Wonder3D with ISOMER improves the results. Methods like OpenLRM, CRM, and Instant-Mesh use no more than $384$ resolution Marching Cubes or Flexible Cubes, as these algorithms require evaluating $512^3$ SDF values at $512$ resolution, which needs over 24GB of GPU memory and nearly a few minutes of runtime. A $1024$ resolution Marching Cubes theoretically takes over ten minutes and >100GB of GPU memory. In contrast, ISOMER can achieve sharp detail reconstruction in just a few seconds.

Where is the "Quantitative Evaluation of the ISOMER Module"? I couldn't locate it in the rebuttal PDF.

The "Quantitative Evaluation of the ISOMER Module" refers to the indirect quantitative comparison between Wonder3D and Wonder3D + ISOMER. We will add a direct comparison based on ground-truth multiview normals in the revised paper. Compared to the ground-truth multiview normals, we believe the input for the reconstruction operation should be the generated multiview from the front view rather than the accurate ground-truth multiview. Therefore, we consider comparing using multi-view normal generated Wonder3D, which has inconsistencies across multiple views, to be more relevant to the actual task of this reconstruction algorithm.

Regarding the "irrotational normal", what is the motivation behind "introducing a random rotation to the normal map before integration? The process is repeated several times, and the mean value of these integrations is then used to calculate the depth, providing a reliable estimation." How does this solve the problem?

"Irrotational normal" implies that the integral values along different paths should be the same, but since the predicted "normal map" is not "irrotational," these integral values actually differ. We choose straight lines as the integration path because it is the most straightforward to operate. Different straight lines yield different integral values, so we take their expectation as the final integral value. This is why random rotations are needed to calculate the mean value. We plan to explore using a direct depth prediction model to accomplish this task instead of calculating depth from normals in the future.

In the rebuttal, the qualitative examples provided are limited to one or two instances, which is not very convincing or helpful for understanding. Please avoid this and include more examples in your revision.

Thank you for the suggestion! We agree with you, but due to the limited time for the rebuttal and the addition of extensive experiments and code, many experiments were not fully completed. We will include at least six representative examples for each qualitative experiment in the revision to enhance the paper.

Thank you very much for your insightful questions and suggestions, and we welcome further discussion!

审稿意见

评分: 6置信度: 42024-07-17

This paper introduces Unique3D, a framework aiming to generate 3D meshes from single-view images with high quality and fidelity. Driven by the observation that 2D image pixel and normal priors with higher resolution can be crucial in generating intricate textures and complex geometries, Unique3D integrates a multi-level upscale process to progressively improve the resolution. It also proposes an instant and consistent mesh reconstruction algorithm called ISOMER to lower the computation complexity and improve reconstruction quality.

优点

Unique3D addresses the limitations of previous methods by generating high-resolution 2D images and normal maps and optimizing the color mesh according to these 2D guidances. The idea of increasing resolution in 2D is straightforward and powerful as the experiments suggest. The pipeline is efficient and can generate mesh within a short time frame (about 30 seconds per mesh). This makes it practical for many real-world applications. The paper provides solid and detailed results for the effectiveness of its methods and delivers rather convincing results. The proposed ISOMER is novel and could inspire further studies in object color mesh reconstruction. The paper includes a thorough analysis of the ISOMER algorithm which provides good insights into the module.

缺点

This pipeline features an image diffusion model, two super-resolution modules, and a normal diffusion model. All work in a sequential manner without interaction. This may bring significant compounding errors in multi-view/multi-resolution consistency before reconstructing mesh with ISOMER. The author should include a more thorough analysis on error patterns with particular focus on multi-view/multi-resolution consistency.
The subsection "ExplicitTarget Optimization for Multi-view Inconsistency and Geometric Refinement" (L183-) is hard to read and understand. Improvements in writing and equations are needed.
Based on my understanding of the ISOMER algorithm, it could be prone to inconsistencies, especially around the 2D boundaries of the generated multi-view images. Could you provide more insights into this aspect? Also, how well is the model working with non-front-facing input views? Adding some discussion for these could further improve the coverage of the experiments and help people understand the limitations.
30 random objects (L257) are too few to be a meaningful quantitative evaluation.

问题

See weakness 1, 3, 4.

局限性

Limitations are discussed very briefly in the conclusion section. Addressing the questions in the above weaknesses could involve further discussions on the limitations. Societal impacts are included in the supplementary materials.

作者回复

2024-08-07

Comment 1: Multi-view Consistency Analysis

We appreciate the reviewer bringing this comment up. To answer your concern, first, our choice to adopt a sequential design is based on the following observations: 1) Integrating the image diffusion model with two super-resolution models into a 2048-resolution diffusion model would result in an enormous computational load due to the high resolution. 2) We noticed that methods combining both image and normal diffusion models (e.g., Wonder3D) frequently yield normal results lacking sufficient details, and the guidance scale does not effectively control the output. Our experiments on this hybrid architecture were in alignment with these observations. By decoupling the two models, we have notably enhanced the accuracy of the normals, as the normal diffusion model does generations based on clear images rather than noisy image latents. Therefore, we have chosen to use separate image and normal diffusion models to improve the fidelity of the normals and the subsequent mesh reconstruction. ISOMER can even be used to improve the consistency of other methods. For example, in Table 1, we replaced Wonder3D's reconstruction method with ISOMER, which is not only faster but also of higher quality.

Comment 2: Clarity in "ExplicitTarget Optimization" Subsection

We value your insightful suggestions. To address your concern, we will revise the explanation and equations in "ExplicitTarget Optimization" subsection for better comprehension. Here is our improved version:

(ExplicitTarget). Let $Avg(V, W) = \frac{\sum_i{V_i W_i}}{W_i}$ represent the weighted average function, and $V_{M}(v, i): (\mathbb{N}^+, \mathbb{N}^+) \rightarrow \{0, 1\}$ represent the visibility of vertex $v$ in mesh $M$ under view $i$ . $Col_M(v,i)$ Indicate the color of vertex v in viewpoint i. We compute the ExplicitTarget $ET$ of each vertex in mesh $M$ as

ET_{M}(v) = \\begin{cases} Avg(Col_M(v,i), V_{M}(v, i) W_M(v, i)^2) & \text{, if } \sum_{i} V_{M}(v, \mathcal{i}) > 0 \\\\ \mathbf{0} & \text{, otherwise, } \\end{cases}

where $W_M(v, i) = -\cos(N_{v}^{(M)}, N_{i}^{(view)})$ is a weighting factor, $N_{v}^{(M)}$ is the vertex normal of $v$ in mesh $M$ , and $N_{i}^{(view)}$ is the view direction of view $i$ .

Comment 3: ISOMER Boundary Inconsistencies and Non-front-facing Views

Thanks for your thorough comment. We agree that the key insight of ISOMER could be better clarified. Following your suggestions, we will revise our paper to emphasize that ISOMER directly handles the case where the global normal of the same vertex is inconsistent across viewpoints.

As demonstrated in our ablation study (Figure 5), previous reconstruction algorithms often yield poor results with blurriness or wavy patterns when faced with inconsistent inputs, while adopting ISOMER notably improves the reconstruction results. We include a new evaluation experiment comparing Wonder3D with and without ISOMER to provide a clearer understanding of the benefits in Table 1.

Additionally, we added a new test with rotated objects in Table 2 to test robustness in non-front-facing views. The test results show that Unique3D still performs well in this case, and even the geometry prediction is more accurate (Because the shape of an object is better estimated, e.g. a book).

Comment 4: Insufficient Quantitative Evaluation

We chose 30 random objects following the standards set by previous work (CRM, SyncDreamer, Wonder3D). Given that SyncDreamer takes up to 50 hours to generate 30 objects and results from other papers show that SyncDreamer performs significantly worse than existing methods. We exclude SyncDreamer and conduct a thorough experiment on the full GSO dataset. During our experiments, we found that GRM’s online samples were unavailable and lacked open-source code, preventing further testing of GRM’s results. The results for all methods are provided in Table 1.

2024-08-12

Dear Reviewer Bw7b:

Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?

Best, AC

2024-08-14

Thanks for the response. I find most of my concern addressed and will keep my score leaning towards acceptance.

评论- Thank You for Your Thorough Review and Invitation for Further Discussion

2024-08-13

Dear Reviewer,

Looking forward to your feedback and any further questions you may have.

Best regards, Authors

作者回复

2024-08-07

We thank the reviewers for their patience in reviewing, and we will respond to each of them individually, with additional experiments added in the Appendix PDF.

最终决定Accept (poster)

2024-09-25

This paper was reviewed by four experts in the field. Although the initial review is mixed, after rebuttal and discussion, all reviewers converge to acceptance, and the final rating is 5, 6, 6, 7. All reviewers agreed that this is a high-quality work, with clear presentation, simple but effective ideas, and high-quality results. AC also agrees that this is well-presented and important work and thus would suggest acceptance.

Still, we recommended the authors to carefully read all reviewers’ final feedback, and revise the manuscript as suggested in the final camera-ready version, particularly about ablation study and more diverse experiment. We congratulate the authors on the acceptance of their paper!