PaperHub
6.8
/10
Poster4 位审稿人
最低6最高7标准差0.4
7
7
6
7
3.8
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

VCR-GauS: View Consistent Depth-Normal Regularizer for Gaussian Surface Reconstruction

OpenReviewPDF
提交: 2024-05-02更新: 2024-11-06

摘要

关键词
Gaussian Splatting3D ReconstructionSurface Reconstruction

评审与讨论

审稿意见
7

The paper presents a method for improving geometry reconstruction for 3D Gaussian Splatting. It adopts a relatively flattened 3D Gaussian and incorporates normal regularization from monocular priors. Specifically, the paper proposes supervising the geometry by minimizing the differences between the normals derived from the depth map and the monocular normals. The method further refines the depth map by calculating the ray-Gaussian intersection and introduces a densification technique that splits large Gaussians on the surface into smaller ones, effectively reducing depth errors. To address multi-view inconsistencies arising from monocular priors, the paper proposes a weighting scheme to boost performance. Extensive results show that the method achieves competitive results on common benchmarks, including TNT, MipNeRF360, DTU, and Replica.

优点

  1. The paper presents a confidence map to resolve the multi-view inconsistencies arising from monocular priors estimated from pre-trained models.
  2. The paper identifies that large Gaussians cause significant depth errors and proposes a reasonable splitting scheme to address this issue.
  3. The paper is well-written, easy to follow, and the evaluation is comprehensive.

缺点

  1. The contributions seem limited. The core component of the paper is the combination of depth normals and rendered normals for supervision with monocular priors. However, the depth normals and rendered normals appear similar to those in 2DGS, despite this paper using monocular priors for supervision between them. The intersected depth also shares a similar motivation to 2DGS and Gaussian Surfels [1]. However, the paper lacks acknowledge and reference to these works. The paper should mention these or provide sufficient discussion.

  2. While the introduced monocular normals improve reconstruction quality, they also cause oversmoothing, as shown in the DTU results. Additionally, it remains unclear how the monocular priors work for more general cases. The paper should discuss these as limitation.

  3. The idea of using monocular normal prior has been adopted in [1], the paper should have a citation or comparison.

[1] High-quality Surface Reconstruction using Gaussian Surfels. In SIGGRAPH'24

问题

  1. The paper states that using the rendered normal for supervision can only update the rotation of the Gaussians. Why? Given that the rendering equation involves all the Gaussians and they are all differentiable, if we use a multi-view normal map and train the Gaussians with the rendered normal map, is it really true that these Gaussians will not be optimized into a shape that fits the given normal maps?

  2. Why setting the last scale to zero poses difficulties in optimization (L109)? Many papers have verified that setting the Gaussians to be 2-dimensional does not hurt performance [1][2]. I believe more justification is needed.

  3. Why 2DGS is faster than the proposed method in the training but has a lower FPS. The paper states the reason is that 2DGS applies a time-consuming ray-splat technique. However, according to the 2DGS paper, the ray-splat intersection does not seem as expensive, at least not more expensive than in this paper, since they also use intersected depth. Some analysis would be beneficial.

[1] High-quality Surface Reconstruction using Gaussian Surfels. In SIGGRAPH'24

[2] 3D-HGS: 3D Half-Gaussian Splatting. arXiv:2406.02720

Minors:

L175: "splitting" should be "splatting."

L198: I am also concerned about the use of "view consistent". In my opinion, the monocular normal priors are still multi-view inconsistent, although the proposed weighting scheme can reduce some negative effects.

局限性

The weakness has been outlined previously, and I have a few questions that need clarification. I may raise my score if these questions are addressed properly.

作者回复

Response to Reviewer 5p3w (R#4)

Q1: Normal priors cause oversmoothing, as shown in the DTU results. Additionally, it remains unclear how the normal priors work for more general cases.
A: We agree that the monocular priors cause oversmoothing on the DTU results in Fig. 10 of our paper. The over smoothness on the DTU dataset is due to the monocular normal estimator DSINE we used that is mainly trained on scene-level dataset with only 160K images, and therefore cannot produce details on the object-level DTU dataset. To obtain more details of normal prior, we can train a larger DSINE with more parameters on a larger dataset, e.g. Omnidata aa with 12M images, Objaverse bb with over 10M objects, and MVImageNt cc with 6.5M frames.

We are not sure of the meaning of “more general cases” mentioned in the reviews. We reckon that this refers to the “generalization ability across different scenes”, and we provide our clarification based on this understanding. Actually, we have performed experiments on four datasets: TNT, DTU, ScanNet++, and Mip-NeRF360 which cover large outdoor scenes, small objects, and large indoor scenes. We think these different types of scenes are sufficient to verify the effectiveness of our method.

Q2: The paper states that using the rendered normal for supervision can only update the rotation of the Gaussians. Why?
A: We want to clarify why the supervision of the rendered normal maps can only update their rotations without directly affecting their positions. From the rendered normal equation shown below, we can see that 1) Regularization on normal maps can affect or supervise the normal of Gaussians in 3D space. In this case, since the normal is only determined by the rotation and scale of a Gaussian, the position cannot be updated. 2) you are right that the positions of Gaussians are also updated (thanks for catching this! We will update the paper to clarify this is what we meant). A more accurate expression is that the supervision of the rendered normal maps can only effectively update their rotations without effectively affecting their positions for surface reconstruction. In fact, the positions are only updated through the alpha weights by affecting G(x) in the alpha-blending equation. As illustrated in Fig. 2 (a) and (b) of the PDF file in our rebuttal, this can only move the 3D Gaussian towards or further away from the intersecting ray through the blended alpha. In other words, the 3D Gaussians cannot move along the camera ray to be closer to the ground truth surface. In contrast, as illustrated in Fig. 2 (c) of the PDF file in our rebuttal, our D-Normal regularizer can move the Gaussian in the direction parallel to the pseudo normal of the surface. To verify that the D-Normal regularizer actually behaves like this, we also visualize the optimized Gaussian centers of a scene in Fig.1 of the PDF file in our rebuttal. We can see that D-Normal regularizer significantly eliminates off-surface Gaussians.

\\hat{\**N**} \= \\frac{\\sum\_{i \\in M} {n\_i} \\alpha\_i \\prod\_{j=1}^{i-1} (1 \- \\alpha\_j)}{\\sum\_{i \\in M} \\alpha\_i \\prod\_{j=1}^{i-1} (1 \- \\alpha\_j)}, \\alpha\_i \= o\_i G(\**x**\_i)

Q3: Why does setting the last scale to zero pose difficulties in optimization (L109)?
A: The reasons are: 1) Setting the last scale to zero is a poor initialization for the normals of the Gaussians since all normals of the Gaussians in Gaussian Surfels dd are initialized to 0,0,10,0,1 . Also, setting the last scale to zero means that the Gaussians can only be rotated to achieve alignment with the surface during optimization. In contrast, our scale regularization is less constrained by offering the possibility of selecting the minimal scaling factor as the normal in addition to rotating the Gaussians. 2) At the beginning, nothing can be observed at the view that is vertical to the Gaussians that are initialized as surfels. As shown in Fig. 3 of the PDF file in our rebuttal, this leads to a local minimum with no optimization. However, this situation does not happen in our method because we gradually flatten the 3D Gaussians using scale regularization. We perform ablation in Tab. 5 H to show that an F1-score improvement can be achieved with our scale regularization instead of setting the last scale to zero.

Q4: Why 2DGS is faster than the proposed method in the training but has a lower FPS.
A: The answer can be found in R#3Q5.

Q5: L175: "splitting" should be "splatting."
A: Thanks for pointing it out, we'll correct it in the final version.

Q6: L198: Concerned about the use of "view consistent".
A: We clarify that the “view-consistent” in L198 does not refer to the monocular normal priors used to supervise the D-Normals. The monocular normal priors are not changed and they are still inconsistent across multi-views. We mean that the addition of the confidence term in Eq. (12) downweighs the terms that have high multiview inconsistency, and therefore the remaining terms in the loss are “view-consistent”. The result drops by 0.04 F1-score without the confidence (Tab. 4 B) and with the D-Normal regularizer. It demonstrates that confidence can mitigate the problem of inconsistency of the predicted normal maps. In our title, “view consistent” is used to describe the proposed confidence term which places more emphasis on the view of consistent terms in the loss in Eq (13).

Q7: The contributions seem limited. The depth normals and rendered normals appear similar to those in 2DGS. The intersected depth shares a similar motivation to 2DGS and Gaussian Surfels. Using monocular normal prior has been adopted in dd .
A: The answer can be found in the part of ‘To all Reviewers’ at the beginning.

评论

Thank you for your response. While the idea of incorporating monocular normal priors and a better densification strategy has potential, the current submission has significant issues that make it difficult to accept:

  1. The major insight - supervising normals rendered from 3D Gaussians, which only updates the rotation parameter - is not well-justified. The response (plus the illustration figure in the PDF file) only considers a case optimizing normal from a single view, but the paper uses multiple-view normal maps for supervision. Therefore, the current explanation is not well adapted to multiview settings. Although the ablation study shows that using depth-normal is better than render_normal, I believe that this is because using depth-normal will smooth the depth map, and thus enhance TSDF fusion. This core insight needs stronger justification.

  2. The paper fails to adequately discuss related prior work, particularly [1], which has been referenced in the current submission. Several formulations and methods (e.g., D-normal, normalized expected depth, and meshing approach) are identical to those in [1], yet there is no proper citation or discussion. The response claims concurrent development, but [1] was published two months before the submission deadline, and the major tables in this paper are also from [1]. This raises concerns about the originality and independence of this work. The claim of important analysis not found in [1] is unconvincing, as the formulation remains the same, and [1] did not require monocular priors for their analysis.

  3. The use of monocular priors, though potentially beneficial, is not sufficiently motivated or insightful in the current manuscript. MonoSDF has successfully applied monocular normal priors to challenging scenarios like DTU (3 views), ScanNet, and TnT advance, whereas the current evaluation on DTU (48+ views), TnT, and MipNeRF360 does not sufficiently justify the need for monocular normal priors. Additionally, the argument that normal consistency loss in [1] can lead to conflicting updates is undermined by the fact that the proposed method (F1 score: 0.37) still performs worse than GOF [2] (F1 score: 0.46), which used normal consistency without monocular priors and was published four weeks before the submission deadline.

  4. The response mentions omitting auxiliary renderings such as depth, normal, and semantic maps for efficiency. However, it is unclear whether this omission was also applied to [1], making the efficiency comparison questionable.

In summary, the manuscript is not well-positioned, and its major insights are flawed and lack proper justification. The adoption of monocular priors, while not novel, fails to demonstrate significant effectiveness compared to prior work. Therefore, I am leaning toward rejection. With substantial revisions and better positioning, the paper could potentially have a greater impact in the future.

[1] 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. arXiv:2403.17888 [2] Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes. arXiv:2404.10772

评论

Response to Reviewer 5p3w (R#4)

Q1: The major insight - supervising normals rendered from 3D Gaussians, which only updates the rotation parameter - is not well-justified. The response (plus the illustration figure in the PDF file) only considers a case optimizing normal from a single view, but the paper uses multiple-view normal maps for supervision. Therefore, the current explanation is not well adapted to multiview settings.
A: The figures drawn in the single view are simplified to illustrate the optimization of positions of Gaussians under normal and d-normal supervisions. As we said in the rebuttal, supervision on rendered normal maps cannot effectively affect the positions of Gaussians, while our d-normal regularizer can. We show the mathematical proof below.

**Propositions:**

***Case 1:***
Supervision on rendered normal cannot effectively affect the positions of Gaussians

***Case 2:***
Supervision on our D-Normal regularizer can effectively affect the positions of Gaussians

**Proof:**

Without a loss of generality, we omit the summation over multiple views in our following derivations for brevity.

Based on the loss mathcalL_textn\\mathcal{L\_{\\text {n}}} on rendered normal (cf. Eq. 8 of our paper), the gradient of mathcalL_textn\\mathcal{L\_{\\text {n}}} with respect to position mathbfp\\mathbf{p}:

\ $ \\frac{\\partial \\mathcal{L\_{\\text {n}}}}{\\partial \\mathbf{p}\_i} &= \\frac{\\partial L\_n } {\\partial \\hat{\\mathbf{N}}} \\cdot \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\mathbf{p}\_i} \\\\ \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\mathbf{p}\_i} &= \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} \\cdot \\frac{\\partial \\alpha\_i}{\\partial \\mathbf{p}\_i} \+ \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\mathbf{n}\_i} \\cdot \\frac{\\partial \\mathbf{n}\_i}{\\partial \\mathbf{p}\_i} \\\\ &= \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} \\cdot \\frac{\\partial \\alpha\_i}{\\partial G(\\mathbf{x})} \\cdot \\frac{\\partial G(\\mathbf{x})}{\\partial \\mathbf{p}\_i} \\\\ &=\\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} \\cdot \\frac{\\partial \\alpha\_i}{\\partial G(\\mathbf{x})} \\cdot $ \-G(\\mathbf{x}) \\cdot (\\mathbf{R} \\mathbf{S} \\mathbf{S}^\\top \\mathbf{R}^\\top)^{-1} \\cdot (\\mathbf{x} \- \\mathbf{p}\_i) $ \\\\ &\\approx \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} \\cdot \\frac{\\partial \\alpha\_i}{\\partial G(\\mathbf{x})} \\cdot $ \-G(\\mathbf{x}) \\cdot (\\mathbf{x} \- \\mathbf{p}\_i) $ \\propto (\\mathbf{x} \- \\mathbf{p}\_i) \ $ \\tag{1}

Based on the D-Normal regularization mathcalL_textdn\\mathcal{L\_{\\text {dn}}} (cf. Eq. 11), the gradient of mathcalL_textdn\\mathcal{L}\_{\\text {dn}} with respect to position mathbfp\\mathbf{p}:


$

\frac{\partial \mathcal{L}_{\text {dn}}}{\partial *p**_i} &= \frac{\partial \mathcal{L}_{\text {dn}}}{\partial \bar{*N**}_d} \cdot \frac{\partial \bar{*N**}_d}{\partial \hat{D}} \cdot \frac{\partial \hat{D}}{\partial *p**_i}, \\
\frac{\partial \hat{D}}{\partial *p**_i} &= \frac{\partial \hat{D}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial *p**_i} + \frac{\partial \hat{D}}{\partial d_i} \cdot \frac{\partial d_i}{\partial *p**_i} \\
&= \frac{\partial \hat{D}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial G(*
x**)} \cdot \frac{\partial G(*x**)}{\partial *p**_i} + \frac{\partial \hat{D}}{\partial d_i} \cdot r_z \cdot \frac{\mathbf{n} } {\mathbf{n} \cdot \mathbf{r}}. \tag{2}

$

We can deduce the following from Eq (1) and (2):

**Case 1:**
From Eq. (1), we can see that the gradient-update fracpartialhatmathbfNpartialmathbfp_i\\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\mathbf{p}\_i} of position is independent of the normal mathbfn\\mathbf{n}. Consequently, the ***supervision on rendered normal cannot effectively affect*** the Gaussian position mathbfp\\mathbf{p}.

**Case 2:**
From Eq. (2), there is an additional term with fracmathbfnmathbfncdotmathbfp\\frac{\\mathbf{n}}{\\mathbf{n}\\cdot\\mathbf{p}}, where the denominator mathbfncdotmathbfr\\mathbf{n} \\cdot \\mathbf{r} is a scalar term. This effectively makes the change in the position fracpartialhatDpartialmathbfp_i \\frac{\\partial \\hat{D}}{\\partial \\mathbf{p}\_i} to move along the direction of the normal mathbfn\\mathbf{n}. Consequently, the ***supervision on D-Normal directly affects*** the Gaussian position mathbfp\\mathbf{p}.

We can further deduce that the gradient-update on the Gaussian position ***pulls the position along the normal*** towards the surface, which achieves better reconstruction.

**(Q.E.D)**

In view of the above proof, we conclude that it is better to do supervision on the D-Normal regularizer.

评论

Q2: Although the ablation study shows that using depth-normal is better than render_normal, I believe that this is because using depth-normal will smooth the depth map, and thus enhance TSDF fusion. This core insight needs stronger justification.
A: We respectfully disagree that using depth-normal will smooth the depth map, and thus enhance TSDF fusion. From Fig. 1 of the PDF file in our rebuttal, we can see that the proposed D-Normal regularizer effectively pushes the 3D Gaussians towards the surface thus providing much cleaner reconstruction, while only rendered normal supervision cannot. We further provide the mathematical proof in our response to Q1.

Q3: The paper fails to adequately discuss related prior work, particularly 11 , which has been referenced in the current submission. Several formulations and methods (e.g., D-normal, normalized expected depth, and meshing approach) are identical to those in 11 , yet there is no proper citation or discussion.
A: We have cited 2DGS in our paper and even have done a comprehensive comparison with it on four datasets. We ***did not claim*** the expected depth and meshing approach as our contribution. The meshing approach TSDF fusion is a common method used in 3D reconstruction and is not first proposed by 2DGS. We'll make this clearer in our revision.

Q4: The response claims concurrent development, but 11 was published two months before the submission deadline, and the major tables in this paper are also from 11 . This raises concerns about the originality and independence of this work.
A:
**Concurrent Development**

We respectfully request the reviewer to refer to the arXiv version of 11 , it can be clearly seen that the first version of 11 is posted on:

v1v1 Tue, 26 Mar 2024 17:21:24 UTC (23,221 KB),

which is clearly within 2 months of the NeurIPS submission deadline on 22 May 2024. Furthermore, it usually takes a few days for the paper to be available after submission to arXiv.

It is stated in the “NeurIPS 2024 FAQ for Authors” page that:

**What is the policy on comparisons to recent work?** Papers appearing less than two months before the submission deadline are generally considered concurrent to NeurIPS submissions. Authors are ***not expected*** to compare to work that appeared only a month or two before the deadline.

11 2D Gaussian Splatting for Geometrically Accurate Radiance Fields. arXiv:2403.17888

**Originality and Independence**

We strongly disagree.

We ***do not claim*** that we are the first to derive the depth normal formulation in our submission. We cited and mentioned in our paper that we are inspired by VLN (*cf.* 5555 in our paper) which shows the depth normal derivation for depth map prediction. We also mentioned in our submission that our contribution lies in our observation that ***supervising the D-Normal with monocular normal prior*** leads to a complete update of the geometric parameters in the Gaussians.

We emphasized in our rebuttal (*cf.* To all reviewers) that 11 ***does not provide*** the insight and analysis that updating the D-Normal would effectively update all geometric parameters in the Guassians. It is clear that 11 is ***not aware*** of this important finding since they propose a ***weaker normal consistency loss*** (*cf.* Eq. 14 of 11 ) that updates the splat’s normal, which can lead to performance drop. We show the ***mathematical proof*** in our response to Q1, and we have ***shown experimentally*** in Column A vs B of the table in the response to R#3 that updating the splat’s normal indeed causes a drop in performance.

Lastly, we reiterate our other important contributions (*cf.* To all reviewers) on our ***geometric-aware confidence term*** and ***densification and splitting*** procedures that lead to further improvements in our performance.

**Major tables in this paper are also from 11 **

Although we are not obligated to compare with 11 , we did not ignore 11 completely. We still do a comparison and cite 11 and show that our proposed method outperforms them in our submission.

评论

Q5: The claim of important analysis not found in 11 is unconvincing, as the formulation remains the same, and 11 did not require normal priors for their analysis.
A: We respectfully disagree with that ‘The claim of important analysis not found in 11 is unconvincing’. 2DGS just claimed ‘By aligning the splat normal with the estimated surface normal, we ensure that 2D splats locally approximate the actual object surface’, and proposed a weaker normal consistency loss. In contrast, we give the insights show an insightful and important analysis on how our D-Normal formulation can lead to a complete and effective update of all the geometric parameters (rotation, scale, and position) of the Gaussian (*cf.* L184-185 of our paper) in comparison with the naive formulation of normal which updates effectively only the rotation and scale (*cf.* Eq. 5 of our paper). More details can be found in the part of ‘To all Reviewers’ at the beginning of our rebuttal.

Q6: The use of monocular priors, though potentially beneficial, is not sufficiently insightful in the current manuscript. MonoSDF has successfully applied normal priors to challenging scenarios, whereas the current evaluation on DTU, TnT, and MipNeRF360 does not sufficiently justify the need for normal priors.
A: MonoSDF is a NeRF-based method, and supervision on normals can effectively affect the positions in 3D space (cf. L37 to 39 of our paper). Different from NeRF-based methods, the normals in Gaussian-based methods are not determined by the position of Gaussians. As a result, supervising on normal cannot effectively update the position of Gaussian for surface reconstruction. We thus propose to supervise d-normal with monocular normal priors to effectively update the full parameters of the Gaussians. The effectiveness of our D-Normal regularizer can be seen in Tab. 4 A, the performance drops 0.10 F1-score from 0.4 to 0.3 without the D-Normal regularizer on the TNT. We further show the proof of how the supervision on normals is essential for surface reconstruction in our response to Q1.

Q7: The argument that normal consistency loss in 11 can lead to conflicting updates is undermined by the fact that the proposed method (F1 score: 0.37) still performs worse than GOF 22 (F1 score: 0.46), which used normal consistency without monocular priors and was published four weeks before the submission deadline.
A: The argument of: “GOF uses normal consistency loss shows better results than ours with D-Normal regularizer implies normal consistency loss is better than ours” is illogical. The fair comparison is to use the same baseline with normal consistency loss vs. our d-normal regularizer. As shown in Column B in the table of R#3Q2, replacing our d-normal with the normal consistency in our framework leads to a performance drop from 0.4 to 0.37. This supports our claim that our D-Normal is better than the normal consistency loss.

The main contribution of GOF is a new mesh extracting method, and it can be seen as ***complementary to our method***. Since our method is to optimize Gaussians close to the surface while GOF extracts the mesh after the optimization, both methods can be used together for better construction. Although GOF is ***marginally better*** than us (0.46 v.s. 0.4), it is ***significantly less efficient*** at the cost of longer mesh extraction. We use TSDF fusion which costs ~2 minutes while GOF uses multiple Marching Tetrahedra which cost ~50 minutes. The marginal improvement in the performance of GOF comes with a ***large tradeoff in computational time***.

Furthermore, GOF is put on arXiv on:

** v1v1 ** Tue, 16 Apr 2024 17:57:19 UTC,

which is less than 2 months from the NeurIPS submission deadline on 22 May 2024. We thus have no obligation to make comparisons with them in our submission according to the guidelines of NeurIPS (*cf.* our response to Q2).

22 Gaussian Opacity Fields: Efficient and Compact Surface Reconstruction in Unbounded Scenes. arXiv:2404.10772

评论

Q8: The response mentions omitting renderings such as depth, normal, and semantic maps. It is unclear whether this omission was also applied to 11 , making the efficiency comparison questionable.
A: As we have shown in our rebuttal, 2DGS is slow mainly because of their rendering method: compute the intersection point first and then use it to get the Gaussian value for rendering. This is different from our rendering method which remains the same as the original 3D Gaussian Splatting. Omitting auxiliary renderings will only increase efficiency slightly. The use of semantic maps is proposed by us to trim mesh for rendering, and it is not present in 2DGS. As shown in the table below, we provide a comprehensive comparison of the rendering speed on Mip-NeRF 360. From the table, we can see that with depth (without intersection depth) and normal rendering, our method is still faster than 2DGS. This verifies that the rendering of our method which follows the original 3D Gaussian Splatting is faster than 2DGS.

With Depth + Normal RenderingOnly Color Rendering
Method2DGSOursOurs3DGS
FPS27114128134
评论

From Eq. (1), we can see that the gradient-update of position is independent of the normal. Consequently, the supervision on rendered normal cannot effectively affect the Gaussian position.

This claim is also not correct. The mistake with the provided proof is that the derivative dN/dαidN/d\alpha_i depends on the normal vector nn.

Let's assume the normal is rendered as: N=iwiniN = \sum_{i}w_i n_i, where wi=αij<i(1αj)w_i = \alpha_i \prod_{j<i}(1-\alpha_j) and αi=oiGi(x)\alpha_i = o_i G_i(x)

We then have: dN/dwi=ni+j=i+1Nnjdwj/dwidN / dw_i = n_i + \sum_{j=i+1}^{N}n_jdw_j/dw_i dwk/dwi=(αi+1+αi+2(1αi+1)+....+ajj=i+1k1(1αj))=j=i+1kTkjαjdw_k / dw_i = -(\alpha_{i+1} + \alpha_{i+2}(1-\alpha_{i+1})+....+a_j\prod_{j=i+1}^{k-1}(1-\alpha_j)) = -\sum_{j=i+1}^{k}T_{kj} \alpha_j dN/dwi=nij=i+1N1nj(k=j+1N1TNkαk)dN / dw_i = n_i - \sum_{j=i+1}^{N-1} n_j * (\sum_{k=j+1}^{N-1}T_{N_k}\alpha_k) dN/dαi=Ti(nij=i+1N1nj(k=j+1N1TNkαk))dN / d\alpha_i = T_i(n_i-\sum_{j=i+1}^{N-1}n_j * (\sum_{k=j+1}^{N-1}T_{Nk}\alpha_k))

From this, we can observe that the gradient update is indeed dependent on the normal vector nn. I hope this correction will help the authors gain a deeper understanding of their method and refine their claims accordingly.

评论

Thank you for your response and detailed explanation. However, I believe there may be some misunderstandings regarding my initial concerns.

First, while I acknowledge that using D-normal improves quantitative results, my primary concern is with the theoretical basis presented in the paper. The central claim that supervising normals rendered from 3D Gaussians only updates the rotation parameter is not entirely accurate. As I previously mentioned, do the authors genuinely believe that using a multi-view normal map to train the Gaussians would not lead to the Gaussians being optimized into a shape that better fits the provided normal maps? The current theoretical explanation does not adequately address this concern.

Second, I have no doubt about the effectiveness of using D-normal with monocular priors. However, I emphasize that the submission lacks sufficient motivation for why monocular priors should be used. If comparable results can be achieved with simple regularization, what is the justification for incorporating monocular priors? From my perspective, monocular priors seem less effective on common datasets under dense view settings, such as MipNeRF 360, DTU (dense), and TnT. I suggest the paper focus on more challenging datasets, like DTU (sparse), ScanNet, and TnT Advance, where traditional SDF methods without priors typically struggle due to sparse views and large textureless regions. In these scenarios, the proposed method's effectiveness could be more pronounced. Additionally, comparisons should primarily be made with methods that also utilize depth or normal priors.

While I agree that the authors are not obligated to compare with concurrent works that appeared less than two months ago, as per NeurIPS guidelines, it would be beneficial to include a discussion on this work. Since the intersected depth, two normals, and the meshing approach are similar to those in [1], a detailed discussion of the differences is encouraged. The authors should emphasize the unique aspects of their method to ensure it is evaluated on its own merits. I also slightly disagree with the authors' claim that they are not required to compare with [1], considering that the baselines of Gaussian-based methods (e.g., TnT and DTU) are produced by [1] and these mesh generations are closely related to [1].

Lastly, while I acknowledge the "geometric-aware confidence term" as effective and interesting, but the incorporation of densification and splitting seems incremental, as this is a known strategy. Additionally, the integration of normal priors and semantic masks to improve performance seems not new to me. Therefore, the paper needs to demonstrate strong results to warrant acceptance, such as by applying the method to more challenging scenes and providing clear improvements over baselines that do not use monocular normals, particularly in highlighted or textureless scenes. However, in the current submission, I find it difficult to perceive significant improvements, such as the over-smoothed results on DTU and comparable results on MipNeRF 360 and TnT when compared to [1] (the original paper, not the reproduction).

Based on the above evaluations, I believe that the disadvantages of the current submission outweigh its advantages, which is the primary reason for my recommendation to reject the paper.

评论

Response to Reviewer 5p3w (R#4)

Q1. My primary concern is with the theoretical basis. Do the authors genuinely believe that using a multi-view normal map to train the Gaussians would not lead to the Gaussians being optimized into a shape that better fits the provided normal maps?
A: With our illustration (cf. Fig. 1 of the PDF file) and theoretical proof (cf. our response to Reviewer 5p3w (R#4) Q1), there is ***no reason for us to doubt*** using multiview normal map to train the Gaussians would not lead to the Gaussians being optimized into a shape that better fits the provided normal maps. We reiterate that from our proof shown in our response to Reviewer 5p3w (R#4) Q1, the gradient of mathcalL_mathbfn\\mathcal{L}\_\\mathbf{n} w.r.t. the position is not affected by the supervision on rendered normal. This conclusion ***holds true on multiview*** since it would be just a summation of all views on the loss function does not change the fact that the gradient w.r.t. position is not affected by the supervision on rendered normal.

Q2: If comparable results can be achieved with simple regularization, what is the justification for incorporating monocular priors? Monocular priors seem less effective on common datasets under dense view settings, such as MipNeRF 360, DTU (dense), and TnT. I suggest the paper focus on more challenging datasets, like DTU (sparse), ScanNet, and TnT Advance, where traditional SDF methods without priors typically struggle due to sparse views and large textureless regions. Additionally, comparisons should primarily be made with methods that also utilize depth or normal priors.

A:
**Dense View Setting**
We respectfully disagree with the reviewer’s comment that “monocular priors seem less effective on common datasets under dense view settings, such as TNT”. As shown in Tab. 1 of our paper on TNT, the recent SOTA methods without normal priors such as SuGaR (CVPR2024) and 2DGS (SIGGRAPH2024) show poor performance (0.19 and 0.3 F1-score respectively) while our method with monocular normal priors achieves significant improvement (0.4 F1-score).

**Textureless Scene**
As shown in Fig. 2 of our paper, we have done experiments on textureless scenes (Replica dataset) and achieved a significantly higher F1-score than 2DGS without monocular normal priors (78.17 vs 64.36).

**Sparse View Setting**
We follow the request of the reviewer to do experiments on the sparse TNT dataset with 80%, 60%, 40%, and 20% images. From the table below, we can see that even when on only 20% of training images, our method still outperforms 2DGS (0.35 vs 0.3) with full images for training, which verifies the effectiveness of the monocular normal prior.

PercentFull80%60%40%20%
2DGS*0.3----
Ours0.40.390.380.360.35

*We omitted the results from 2DGS on lesser than full views since its full views result is already lower than ours at 20% view.

评论

Q3: Since the intersected depth, two normals, and the meshing approach are similar to those in 11 , a detailed discussion of the differences is encouraged. I also slightly disagree with the authors' claim that they are not required to compare with 11 , considering that the baselines of Gaussian-based methods (e.g., TnT and DTU) are produced by 11 and these mesh generations are closely related to 11 .

We ***strongly disagree with the rejection of our paper due to a concurrent work*** 11 put on arXiv within two months of the NeurIPS submission deadline (as per NeurIPS’ policy).

Furthermore, ***we have shown*** comparisons and highlighted our contributions over 11 in both our main paper (cf. Tab. 1, Tab. 3, Tab. 6) and rebuttal (cf. our responses “To All Reviewers” and “Reviewer 5p3w (R#4) Q4” under “Originality and Independence”). We summarize ***our contributions over 11 *** here again:

  1. We provide the ***insight and analysis*** that ***supervising the D-Normal*** would effectively update all geometric parameters in the Guassians. It is clear that 11 is not aware of this important finding since they propose a weaker normal consistency loss (*cf.* Eq. 14 of 11 ) that updates the splat’s normal, which can lead to a performance drop.
  2. We propose a ***geometrically meaningful confidence term*** (cf. L187-198 of our paper) to address the inconsistency across multiple views of the normal prior from a pretrained monocular model.
  3. We devise a ***new densification*** strategy that splits large Gaussians into smaller ones to represent the surface better. In contrast, both 2DGS stopped at regularizing the normals.
  4. Different from 2DGS which has to compute the intersection depth/point first and then render a novel view based on the point because of the different splatting method with original Gaussian Splatting, our method only utilizes the intersection depth for training and surface reconstruction and we can discard it for ***faster rendering*** during inference.

Refer to Tab. 1, 3 and 6 of our main paper and table (shown below) in n7KL (R#3) Q2 in the rebuttal for the experimental comparisons with 11 , where we outperform them in all settings.

A. OursB. Ours + normal consistencyC. 2DGSD. 2DGS + d-normal
F1-score ↑0.40.370.30.34

Additionally, we re-emphasize our last response that we ***do not claim*** the expected depth and meshing approach as our contribution. It is meaningless to compare to 2DGS on meshing approach as it is also not first proposed by them. The meshing approach TSDF fusion is a common method used in 3D reconstruction.

Q4: Lastly, while I acknowledge the "geometric-aware confidence term" as effective and interesting, but the incorporation of densification and splitting seems incremental, as this is a known strategy. Additionally, the integration of normal priors and semantic masks to improve performance seems not new to me.
A: We respectfully disagree. Although ‘densification’ is already present in all 3DGS-based methods, the densification strategy we proposed is uniquely targeted at minimizing the depth errors that arise from the remandant errors of the Gaussian normals after supervision (cf. L205 to L211 of our paper).

To this end, we first randomly sample camera views from a cuboid that encompasses the entire scene for object-centric outdoor scenes and from the training views for indoor scenes. Since we aim to densify only the surface Gaussians, we only keep the first intersected Gaussian and discard the rest for each ray emitted from the camera. Subsequently, we densify only those with a scale above a threshold among the collected Gaussians (cf. L212 to L216 of our paper). In addition, the splitting strategy is proposed to avoid clustering. Specifically, we split the old Gaussian into two new Gaussian along the axis with the largest scale instead of using the Gaussian sampling with the position of the Gaussian as mean and the 3D scale of the Gaussian as variance (cf. L217 to 224). Both these steps lead to significant improvement as shown in Tab. 4E of our paper.

To the best of our knowledge, we are first to propose the above mentioned densification strategy. We ask the reviewer to kindly point us to any specific work(s) that shares the same idea as us. We did not claim the integration of normal priors and semantic masks are our main contributions. Please refer to our response in Q3 for the re-emphasis of our contributions.

评论

Q5: Therefore, the paper needs to demonstrate strong results to warrant acceptance, such as by applying the method to more challenging scenes and providing clear improvements over baselines that do not use monocular normals, particularly in highlighted or textureless scenes. However, in the current submission, I find it difficult to perceive significant improvements, such as the over-smoothed results on DTU and comparable results on MipNeRF 360 and TnT when compared to 11 (the original paper, not the reproduction).
A: We emphasize again that we ***strongly disagree with the rejection of our paper due to a concurrent work*** 11 put on arXiv within two months of the NeurIPS submission deadline (as per NeurIPS’ policy).

Moreover, we respectfully point out that the reviewer has made a ***factual error*** on the claim that our experimental comparisons to 11 are not strong in various scene settings. We re-emphasize that ***we have achieved significant improvements*** over 11 on various settings compared to 11 as mentioned by the reviewer.

On TNT, we have made significant improvements compared with 2DGS 11 and SuGar (0.4 vs 0.3/0.19). Note that the 0.3 F1-score is the official result from 2DGS (not our reimplemented version). On MipNeRF 360, we can only show visualization results since we cannot obtain the ground truth of its mesh. On Replica dataset which has textureless indoor scenes, we have achieved a significantly higher F1-score than 2DGS without normal priors (78.17 vs 64.36) as shown in Tab. 2 of our paper.

The above-mentioned results clearly show the outperformance of our method compared to 11 under various challenging settings.

评论

It seems the author may not be fully addressing my core concern, which is that their claims are incorrect. Throughout the review process, I have been emphasizing:

First, while I acknowledge that using D-normal improves quantitative results, my primary concern is with the theoretical basis presented in the paper. The central claim that supervising normals rendered from 3D Gaussians only updates the rotation parameter is not entirely accurate.

评论

Response to Reviewer 5p3w (R#4)

Q1. The theoretical proof is not correct.
A:
We respectfully disagree with the reviewer’s proof. Although fracpartialhatmathbfNpartialalpha_i\\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} is a vector related to the normal, fracpartialL_npartialhatmathbfNcdotfracpartialhatmathbfNpartialalpha_i\\frac{\\partial L\_n } {\\partial \\hat{\\mathbf{N}}} \\cdot \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} ***is a scalar***, which is a dot product between two vectors. Consequently, the gradient of the position is still proportional to mathbfx\-mathbfp_i\\mathbf{x} \- \\mathbf{p}\_i. It means the moving direction of the position is also unrelated to the normal direction. Here is the complete proof:


$

\frac{\partial \mathcal{L_{\text {n}}}}{\partial \mathbf{p}_i} &= \frac{\partial L_n } {\partial \hat{\mathbf{N}}} \cdot \frac{\partial \hat{\mathbf{N}}}{\partial \mathbf{p}_i}, \tag{1}\\
\frac{\partial \hat{\mathbf{N}}}{\partial \mathbf{p}_i} &= \frac{\partial \hat{\mathbf{N}}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial \mathbf{p}_i} + \frac{\partial \hat{\mathbf{N}}}{\partial \mathbf{n}_i} \cdot \frac{\partial \mathbf{n}_i}{\partial \mathbf{p}_i} \\
&= \frac{\partial \hat{\mathbf{N}}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial G(\mathbf{x})} \cdot \frac{\partial G(\mathbf{x})}{\partial \mathbf{p}_i} \\
&=\frac{\partial \hat{\mathbf{N}}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial G(\mathbf{x})} \cdot \-G(\\mathbf{x}) \\cdot (\\mathbf{R} \\mathbf{S} \\mathbf{S}^\\top \\mathbf{R}^\\top)^{-1} \\cdot (\\mathbf{x} \- \\mathbf{p}\_i) \\
&\approx \frac{\partial \hat{\mathbf{N}}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial G(\mathbf{x})} \cdot \-G(mathbfx)cdot(mathbfx\-mathbfp_i) \-G(\\mathbf{x}) \\cdot (\\mathbf{x} \- \\mathbf{p}\_i) \tag{2} \\
\\
&\text{Putting (2) into (1), we get:} \\
\\
\frac{\partial \mathcal{L_{\text {n}}}}{\partial \mathbf{p}_i} &\approx \frac{\partial L_n } {\partial \hat{\mathbf{N}}} \cdot \frac{\partial \hat{\mathbf{N}}}{\partial \alpha_i} \cdot \frac{\partial \alpha_i}{\partial G(\mathbf{x})} \cdot \-G(mathbfx)cdot(mathbfx\-mathbfp_i) \-G(\\mathbf{x}) \\cdot (\\mathbf{x} \- \\mathbf{p}\_i) \\
&= \beta \cdot \frac{\partial \alpha_i}{\partial G(\mathbf{x})} \cdot \-G(mathbfx)cdot(mathbfx\-mathbfp_i) \-G(\\mathbf{x}) \\cdot (\\mathbf{x} \- \\mathbf{p}\_i) \\
&\propto (\mathbf{x} - \mathbf{p}_i), \text{where } \beta = \frac{\partial L_n } {\partial \hat{\mathbf{N}}} \cdot \frac{\partial \hat{\mathbf{N}}}{\partial \alpha_i} \text{is a scalar}.

$

This completes our proof that the gradient with respect to the position is independent of the normal direction.

评论

Thank you for the further explanation. I understand that Nα\frac{\partial N}{\partial \alpha} is a scalar, but it is still related to the normal vector, which determines the direction of movement. This suggests that when comparing the rendered normal map with the given normal map, the gradient will indeed influence the position. From my understanding, your proof aims to clarify that this movement occurs along the tangent directions, which I agree with. However, if we use multiple normal maps, would the positions still not be effectively influenced?

Another point that seems counterintuitive is that optimizing a scene with normal maps doesn't seem fundamentally different from optimizing one with color maps. Since the latter clearly updates the positions, I'm curious why this would differ when applying normal map supervision.

评论

Response to Reviewer 5p3w (R#4)

Q1. I understand that fracpartialNpartialalpha\\frac{\\partial N}{\\partial \\alpha} is a scalar, but it is still related to the normal vector, which determines the direction of movement. This suggests that when comparing the rendered normal map with the given normal map, the gradient will indeed influence the position. From my understanding, your proof aims to clarify that this movement occurs along the tangent directions, which I agree with. However, if we use multiple normal maps, would the positions still not be effectively influenced?
A: We respectfully disagree with the reviewer’s comment that “fracpartialNpartialalpha\\frac{\\partial N}{\\partial \\alpha} is a scalar, but it is still related to the normal vector, which determines the direction of movement.” We mentioned that fracpartialL_npartialhatmathbfNcdotfracpartialhatmathbfNpartialalpha_i\\frac{\\partial L\_n } {\\partial \\hat{\\mathbf{N}}} \\cdot \\frac{\\partial \\hat{\\mathbf{N}}}{\\partial \\alpha\_i} is a scalar. A scalar value cannot determine the direction of a movement like a vector, it can only affect the amplitude of the movement. Thank you for agreeing “this movement occurs along the tangent directions”. Consequently, the use of multiple normal maps still cannot effectively guide the position of a Gaussian to move toward the surface since the direction towards the surface is absent.

Q2: Another point that seems counterintuitive is that optimizing a scene with normal maps doesn't seem fundamentally different from optimizing one with color maps. Since the latter clearly updates the positions, I'm curious why this would differ when applying normal map supervision.

A: We agree that optimizing a scene with normal maps doesn't seem fundamentally different from optimizing one with color maps and the latter clearly updates the positions. However, the target of optimizing one with color maps is to render high-quality images instead of performing 3D surface reconstruction. The movement of the Gaussian positions by optimizing with color maps is for lowering the error between the rendered color and ground truth color. As shown in Fig. 2 (a) of the PDF file in our rebuttal, when the color (normal) of a Gaussian is closer to the GT color (surface normal), the supervision pushes this Gaussian (\*P_1\**P**\_1) towards the ray to increase its weight in the rendering equation, and vice-versa (\*P_2\**P**\_2). In a nutshell, this type of position update is only to increase the weight of the ‘correct’ Gaussian, which leads the Gaussian closer to the ray, while decreasing the weight of the ‘wrong’ Gaussian, which lets the Gaussian away to the ray. It is not effective for surface reconstruction, where we want to push Gaussians toward the surface. This is the same as normal map supervision.

评论

Thank you for response. Since that paper primarily uses data (Table) from the 2DGS paper and is closely related, I didn’t initially treat it as concurrent work. After confirming with the ACs and reviewing the submission guidelines, I now believe that [1] should be disregarded in this context. I appreciate the authors' efforts in providing stronger results, and I have no issues with the comparisons and evaluations presented. Therefore, I have raised my score to 4.

However, my remaining concern still needs to be addressed. I am willing to increase my score further if this claim is properly resolved.

评论

Thank you for your response. I would like to clarify that the scalar of N/α\partial N / \partial \alpha has signs, which allows it to determine the direction of movement, either from xpx\rightarrow p or pxp\rightarrow x (pull or push), rather than just the amplitude. Additionally, while the movement occurs in the tangent space, points can still move toward the surface due to simultaneous movement and rotation. Given that in a vanilla approach with pure photometric loss, most Gaussian positions eventually adhere to the surface, I believe this will also be the case with normal map supervision, as the normals will fit the tangent space of the surface.

While I find it challenging to reach a full consensus (probably due to the loose definition of "effective" for "surface reconstruction"), I do believe some corrections are necessary:

The normal map supervision can indeed affect the update of positions, so some statements in the paper might need refinement.

Specifically,

L113 DN-Splatter may show severe reconstruction artifacts due to their normal supervision can only update the rotation parameters.

L174 However, the normal is only directly related to the rotation of the Gaussian during Gaussian splitting, which means directly supervising the rendered normals cannot update the positions as shown in Fig. 2.

Figure.2 (Top Right)

Although I am not entirely sure I fully grasp its theoretical analysis, I do not see any major issues as long as the suggested corrections are made. I believe that the results are strong enough and really like the geometric-aware confidence term. Therefore, I am raising my score (7) but also lower my confidence level (3).

评论

Thank you for your understanding and discussion. We will refine our paper based on our discussion and your suggestions in the final version.

审稿意见
7

This paper presents a confidence-aided Depth-Normal regularizer that directly couples the normal with other geometric parameters, thus enabling the optimization of all geometric parameters from monocular normal priors. The paper also introduces a densification and splitting strategy to regulate the size and distribution of the 3D Gaussian distributions for more precise surface modeling. Experimental results demonstrate that the method achieves better reconstruction quality with faster training speeds and rendering compared to 3DGS and 2DGS.

优点

  1. The main contribution of the paper is clear. The proposed Depth-Normal regularizer provides a simple but effective way to utilize the monocular normal priors in 3DGS-based reconstruction.

  2. Experimental results strongly support the proposed strategy.

缺点

  1. The conversion between depth and normals is not a new concept and has been explored in both the field of depth estimation and normal estimation. I suggest referencing some previous works on depth-to-normal conversion, such as VNL[1]. I also acknowledge the contribution of the authors in applying this technique to 3DGS, which allows for better optimization of geometric attributes.

[1] Yin W, Liu Y, Shen C, et al. Enforcing geometric constraints of virtual normal for depth prediction[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 5684-5693.

  1. 2DGS also proposes a normal consistency loss, which connects the depth and normal. It would be better to clarify the difference and conduct the ablation study comparing the two conversion strategies.

  2. Building upon question 2, I'm curious to know if the gain in this paper is primarily due to the introduction of monocular normal priors. For instance, if we maintain the strategy of normal computation in 2DGS and incorporate the monocular normal prior in this paper. Since the major contribution of this paper lies in establishing the correlation between depth and normals, it would be better to provide an explanation or comparison to illustrate the superiority of the used conversion.

  3. As shown in Fig.7 and Fig.10, the details of the reconstruction results could still benefit from further improvement. This is also understandable since the reconstruction is based on the monocular normal prior, whose details are limited.

问题

  1. I am curious why training time is longer than 2DGS but inference time is shorter.

  2. The reconstruction results of 2DGS in TNT as shown in Fig.4 seems different from 2DGS's original paper in Fig.10, such as the ground of 'truck scene' is not incomplete in 2DGS's paper, the head of 'Caterpillar' does not contain the noisy mesh.

局限性

The authors kindly point out their limitations in the paper. I am curious about the semi-transparent object. Does the monocular normal estimation fails in semi-transparent object?

作者回复

Response to Reviewer n7KL (R#3)

Q1: The conversion between depth and normal is from VNL.
A: We have cited VNL in our paper as 5555 , where we mentioned in L39 and L177 that our depth-normal formulation is inspired by them. Nonetheless, we are different from VNL as follows: the depth-normal formulation in VNL is proposed for the depth map prediction task. We adapted it to effectively update all geometric properties of the Gaussians for 3D reconstruction. Additionally, we show the state-of-the-art performance of this task with this formulation.

Q2: 2DGS proposes a normal consistency. Clarify the difference and conduct the ablation comparing the two conversion strategies.
A: We differ from 2DGS as follows:

  1. We are first to show the insightful and important analysis on how our d-normal formulation can lead to a effective update of all the geometric parameters (rotation, scale, and position) of the Gaussian (cf. L184-185 of our paper) in comparison with the naive formulation of normal which effectively updates only the rotation and scale (cf. Eq. 5 of our paper). Since this important analysis is missing in 2DGS, they did not realize that it is better to supervise d-normal with monocular normal map predictions. As a result, they propose the weaker normal consistency loss (cf. Eq. 14 of 2DGS paper), which updates both the splat’s normal and depth normal and thus can lead to conflicting updates that hurt performance.

    To verify our d-normal is better than the normal consistency used in 2DGS, we replace our d-normal with the normal consistency in our framework as shown in Column B in the table below. This leads to a performance degradation from 0.4 to 0.37.

  2. We propose a geometrically meaningful confidence term (cf. L187-198 of our paper) based on the cosine distance between the rendered and predicted normals to downweigh inconsistent normal priors across multiple views. Consequently, our confidence term suppresses excessive errors with high inconsistency from dominating the overall cost.

  3. Although the normal supervision has made the normals more accurate, there is still a minor error leading to depth error arising from the remnant large Gaussians. We further devise a new densification strategy that splits large Gaussians into smaller ones to represent the surface better.

A. OursB. Ours + normal consistencyC. 2DGSD. 2DGS + d-normal
F1-score ↑0.40.370.30.34

Q3: If the gain is due to the normal priors. Provide a comparison to illustrate the superiority of the used conversion.
A: Tab. 4 of our paper shows the contribution of each component of our framework in improving the performance. In addition to the monocular prior, our densification and splitting strategy also contributes significantly to improving the performance on the F1-score from 0.33 to 0.40. The confidence term and the intersection depth are also playing an important role in increasing the F1-score by 0.04 and 0.05, respectively. Following the suggestion of the reviewer, we conduct an experiment that adds our view-consistent d-normal regularizer to 2DGS. As shown in column D of the table above, our proposed d-normal regularizer added to 2DGS improves the performance on the F1-score from 0.3 to 0.34. Nonetheless, it is still worse than our method.

Q4: As shown in Fig.7 and Fig.10, the details of the reconstruction need further improvement.
A: We respectfully disagree with the reviewer that Fig. 7 lacks details. Nonetheless, we agree that Fig. 10 is overly smooth and it is caused by the monocular prior whose details are limited. Although the reconstruction in Fig. 10 is too smooth, we still achieved superior performance on the DTU dataset as shown in Tab. 6. The over smoothness on the DTU dataset is due to the monocular normal estimator DSINE that we used for the prior is mainly trained on the scene-level dataset with only 160K images, and therefore cannot produce details on the object-level DTU dataset. To obtain more details of normal prior, we can train a larger DSINE with more parameters on a larger dataset, e.g. Omnidata aa with 12M images, Objaverse bb with over 10M objects, and MVImageNt cc with 6.5M frames.

Q5: Why training time is longer than 2DGS but inference time is shorter?
A: Our long training time is mainly incurred by the computation of the intersection depth, the rendering of semantic maps, and the optimization of more Gaussians generated by our proposed densification strategy. However, we do not need to compute the intersection depth for novel view rendering since we utilize the original rendering method from Gaussian Splatting. Also, the rendering of semantic maps can be given up when novel view rendering since it is only used to trim meshes. Consequently, our inference time (i.e. rendering) is almost as fast as Gaussian Splatting. However, the rendering method in 2DGS differs significantly from Gaussian Splatting in its need to compute intersection depth. Specifically, 2DGS needs to compute the intersection point first and then use it to get the Gaussian value for rendering. The inference (rendering) speed of 2DGS is therefore slower than ours. Furthermore, the training speed of 2DGS is also slower than the original 3DGS due to its slower rendering time. For example, on the TNT dataset, 2DGS needs 34.2 minutes while 3DGS only needs 14.3 minutes.

Q6: The reconstruction of 2DGS in TNT seem different from 2DGS's paper.
A: The pretrained models or meshes on the TNT dataset are not released by 2DGS before the deadline for NeurIPS submission. We thus train 2DGS on the TNT dataset with the official codes of 2DGS and the meshes are extracted by the provided script. This would lead to slight deviations from the original paper.

Q7: Does the normal estimator fail in semi-transparent objects?
A: Yes. Details can be found in R#1Q1.

评论

Dear Reviewer n7KL,

As we are nearing the end of the author-reviewer discussion period: please take a look at the author rebuttal as soon as possible and see if any of your concerns were addressed. Let the authors know that you read their rebuttal even if it didn't change your opinion of the paper. If your opinion changed, please update your review accordingly.

Thank you for your service as reviewer!

--Your SAC

评论

Thanks for authors detailed responses and experiments. I have read the rebuttal, and all of my concerns are addressed. I strongly recommend that authors involve the derivation for supporting "the supervision on rendered normal cannot effectively affect the Gaussian position", as mentioned in answer of Q2 and the discussion with reviewer 5p3w. Based on authors rebuttal, I also agree that the reviewer 5p3w's point: "the normal map supervision can indeed affect the update of positions". So some of the claims in the paper need justification. The d-normal achieves better optimization since it optimizes the normal considering the relation between depth and normal, rather than optimize them independently.

Currently, the paper is solid and clearly benefits the community by its study in normal supervision of 3DGS. So I am willing to raise my score to accept.

审稿意见
6

This paper proposes to reconstruct surface from 3D Gaussians with a view-consistent depth-normal regularization. By introducing normal prior (DSINE/GeoWizard) to regularize the distribution of 3DGS, this approach is able to render smooth and view-consistent depth, facilitating reconstruction. This paper also tries to mitigate the inconsistencies from single-view normal prediction by introducing a confidence map. However, the effect seems to be limited.

优点

  1. The performance is good. The proposed design can assist 3DGS for better depth rendering.
  2. Using the gradient of depth to supervise normal makes sense, which makes the position instead of rotation of each GS effectively optimized and leads to better depth results for reconstruction.
  3. The paper is easy to understand and follow.

缺点

  1. Actually, using prior from monocular normal estimation has been introduced in several works, e.g. DN-Splatter; Would it help if the d-normal (calculated from simple GS depth / ray-GS intersection depth) is applied?
  2. The calculation of confidence is basically the same as D-Norm regularization, according to equation 12 and 13. Typically, in Bayesian learning, we introduce an uncertainty term. It is predicted separately and added in log form as a regularization term to the final loss function. I'm not sure if this approach has been tried before, as the current confidence constraints seem to have limited effectiveness.
  3. Some related work should be compared and cited:
  • Neuralangelo: High-Fidelity Neural Surface Reconstruction
  • Relightable 3D Gaussian: Real-time Point Cloud Relighting with BRDF Decomposition and Ray Tracing
  • 3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting

问题

  1. I'm not sure why manual normalization is needed in the equation (9), because according to equation (4), the denominator should already be a normalized coefficient.

局限性

Similar to many other reconstruction methods, this approach struggles to handle the reconstruction of specular or semi-transparent objects well.

作者回复

Response to Reviewer 6weG (R#2)

Q1: Actually, using prior from monocular normal estimation has been introduced in several works, e.g. DN-Splatter; Would it help if the d-normal (calculated from simple GS depth / ray-GS intersection depth) is applied?
A: As mentioned in L31 and L113-115 of our paper, the normal supervision in DN-Splatter can only effectively update the rotation (and scale) parameters without effectively affecting their positions for surface reconstruction. Refer to Eq. (6) of DN-Splatter for the formulation of normal prediction using only the rotation matrix and scaling coefficient of the Gaussian. We mitigate this limitation by deriving our D-Normal formulation in Eq. (10) of our paper, which is a function of both the normal (thus rotation and scale, cf. Eq (5) of our paper) and the position of the Gaussians. Consequently, our D-Normal would improve the performance when applied to DN-Splatter. We do experiments based on the DN-Splatter official code on the TNT dataset to verify the effectiveness of D-Normal, shown in the table below. We will update this comparison with DN-Splatter in the final version.

Ablation ItemPrecision ↑Recall ↑F1-score ↑
A. DN-Splatter0.130.140.13
B. DN-Splatter + D-Normal0.290.340.31

Q2: The calculation of confidence is basically the same as D-Norm regularization, according to equations 12 and 13. Typically, in Bayesian learning, we introduce an uncertainty term. It is predicted separately and added in log form as a regularization term to the final loss function. I'm not sure if this approach has been tried before, as the current confidence constraints seem to have limited effectiveness.
A: The Bayesian learning in ee plays the same role as the uncertainty term we proposed in the paper, i.e. to downweigh error terms with high uncertainty in the cost function. Following the reviewer’s suggestion, we incorporate an uncertainty term using Bayesian learning following ee . Specifically, we set each Gaussian with a learnable uncertainty term and then render the uncertainty map by alpha-blending uncertainty terms of Gaussians. Finally, we replace our proposed confidence map with the uncertainty map. We can see from the table below our proposed confidence is better than the learnable uncertainty on the TNT dataset.
Bayesian learning in ee is useful only when we have no prior information on the uncertainty term. However, in our case, we know the prior information geometrically as the cosine distance between the rendered and predicted normals. Consequently, our geometric-aware confidence term is more effective in suppressing excessive errors with high inconsistency across multiple views from dominating the overall cost.

Ablation ItemPrecision ↑Recall ↑F1-score ↑Time (min)
A. Uncertainty (Bayesian)0.350.360.3765
B. Ours (confidence)0.390.420.4053

Q3: Some related work should be compared and cited:

ff Neuralangelo: High-Fidelity Neural Surface Reconstruction

gg Relightable 3D Gaussian: Real-time Point Cloud Relighting with BRDF Decomposition and Ray Tracing

hh 3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting
A: Thank you for your suggestion. Neuralangelo has been cited in our paper, and we will cite the two other papers. Since gg is not for surface reconstruction, we only compare our method with Neuralangelo ff and 3DGSR hh on the TNT and DTU datasets as shown in the table below. Since 3DGSR does not show results on the TNT dataset and the code has not been released, we cannot compare our method with it on the TNT dataset. The results of Neuralangelo and 3DGSR are from the original papers. From the table, we can see that compared with Gaussian-based methods, including 2DGS and 3DGSR, our method achieves a significantly higher F1-score for large-scale reconstruction (TNT dataset) and is comparable with current work 2DGS and 3DGSR on object-level reconstruction (DTU dataset). Similar to other Gaussian-based methods such as 2DGS, our method shows worse performance than Neuralangelo. Nonetheless, our method is significantly more efficient than Neuralangelo with a reconstruction speed that is approximately 100x faster and a rendering speed of 145FPS (ours) vs. 10FPS (Neuralangelo).

NeuS-basedGaussian-based
DatasetNeuSNeuralAngelo3DGSR2DGSOurs
DTU (CD ↓)0.840.610.810.80.8
TNT (F1-Score ↑)0.380.5-0.30.4
Training Time (h)> 2420 / 128< 1< 1< 1
Rendering speed (FPS)< 10< 10-68145

Q4: I'm not sure why manual normalization is needed in equation (9), because according to equation (4), the denominator should already be a normalized coefficient.
A: Instead of accumulating depth like color in Eq. (4), we should take the average (normalized) depth of the flattened Gaussians since depth is a geometry entity (vs. appearance of color). We find in the official code of 2DGS that mean depth is also used for large scene reconstruction. We perform an ablation on normalized depth (i.e. mean depth) on the TNT dataset. The table below shows that using mean depth results in better reconstruction than unnormalized depth.

Ablation ItemPrecision ↑Recall ↑F1-score ↑
A. Unnormalized0.350.370.35
B. Normalized0.390.420.40

ee NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections.

评论
  • According to the derivation of the volume rendering formula, the denominator in Equation 9 (iαij=1i1(1αj)\sum_i \alpha_i \prod_{j=1}^{i-1} (1-\alpha_j)) should be a value very close to 1. Although the reported results demonstrate an improvement i due to this normalization, it remains unclear how it works. In the implementation, does the alpha in the denominator participate in gradient backpropagation?
评论

Response to Reviewer 6weG (R#2)

Q1: According to the derivation of the volume rendering formula, the denominator in Eq. 9 should be a value very close to 1. Although the reported results demonstrate an improvement i due to this normalization, it remains unclear how it works. In the implementation, does the alpha in the denominator participate in gradient backpropagation?
A: In art, colors can be represented by multiple color overlays, as shown in Eq. 4. of our paper. However, depth is a geometry attribute in 3D space. It is more suitable to represent depth with the average of depth of Gaussians instead of multiple depth overlays. In the implementation, the alpha in the denominator participates in gradient backpropagation. The equation of the gradient is shown here:

GH: The denominator in Eq. 9 is very close to 1. In art and as shown in Eq. 4. of our paper, colors can be represented by multiple color overlays. However, depth is a geometry attribute in 3D space. It is more suitable to represent depth with the average of depth of Gaussians instead of multiple depth overlays. In the implementation, the ***alpha in the denominator participates in gradient backpropagation***. The equation of the gradient is shown here:


$

\hat{D} &= \frac{\sum_{i \in M} d_i \cdot \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)}{\sum_{i \in M} \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)} = \frac{A}{B} \\
\frac{\partial A}{\partial \alpha_i} &= \prod_{j=1}^{i-1} (1 - \alpha_j) \left d_i\-sum_k=i+1d_kcdotalpha_kprod_j=i+1k1(1\-alpha_j)right d\_i \- \\sum\_{k=i+1} d\_k \\cdot \\alpha\_k \\prod\_{j=i+1}^{k-1} (1 \- \\alpha\_j) \\right \\
\frac{\partial B}{\partial \alpha_i} &= \prod_{j=1}^{i-1} (1 - \alpha_j) \left 1\-sum_k=i+1alpha_kprod_j=i+1k1(1\-alpha_j)right 1 \- \\sum\_{k=i+1} \\alpha\_k \\prod\_{j=i+1}^{k-1} (1 \- \\alpha\_j) \\right \\
\frac{\partial \hat{D}}{\partial \alpha_i} &= \frac{\frac{\partial A}{\partial \alpha_i} - \hat{D} \frac{\partial B}{\partial \alpha_i}}{B}

$

It is implemented with custom CUDA kernels. The CUDA code below is our implementation.

accum_depth_rec = last_alpha * last_depth + (1.f - last_alpha) * accum_depth_rec;
accum_depth_rec2 = last_alpha + (1.f - last_alpha) * accum_depth_rec2;
dL_dopa += (((c_d - accum_depth_rec) - depth_final * (1 - accum_depth_rec2)) / weight_final) * dL_dpixel_depth; 
评论

The derivation of gradient and CUDA implementation are right.

However, I don't think "overlays" is a reasonable term to distinguish color and depth. Actually, the equation 4 is just a way to calculate the expectation / mean of a variable, whether it's color or depth.

I strongly suspect that this difference is due to the influence of the final weight term weightfinal=i=1M(1αi)weight_{final} = \prod_{i=1}^M(1-\alpha_i), which is the residual between 1 and iαij=1i1(1αj)\sum_i \alpha_i \prod_{j=1}^{i-1} (1-\alpha_j) . For color integration/averaging, we add the weightfinalweight_{final} multiplied by the background color (0 or 1) to equation 4 in real implementation. However, for depth, there is not a corresponding depth value for calculation (or to replace the background color). Therefore, we must discard the weightfinalweight_{final} and renormalize the remaining weights.

评论

Q1. I strongly suspect that this difference is due to the influence of the final weight term weight\_{final} \= \\prod\_{i=1}^M(1-\\alpha\_i), which is the residual between 1 and sum_ialpha_iprod_j=1i1(1alpha_j)\\sum\_i \\alpha\_i \\prod\_{j=1}^{i-1} (1-\\alpha\_j) . For color integration/averaging, we add the weight_finalweight\_{final} multiplied by the background color (0 or 1) to equation 4 in real implementation. However, for depth, there is not a corresponding depth value for calculation (or to replace the background color). Therefore, we must discard the weight_finalweight\_{final} and renormalize the remaining weights.
A: We agree that the reason is the influence of the final weight term T \= \\prod\_{i=1}^M(1-\\alpha\_i). For color rendering, the final rendered color is hatmathbfC\+T\*bg_color\\hat{\\mathbf{C}} \+ T \* bg\\\_color. However, we need to normalize the remaining weights for depth since there is no background depth.

审稿意见
7

The paper introduces a novel view-consistent Depth-Normal (D-Normal) regularizer and an uncertainty-aware normal regularizer followed by a new densification and splitting strategy to address the limitations of existing Gaussian Splatting methods in surface reconstruction tasks, such as the supervision of rendered normal updates affecting only the rotation parameter and the inconsistency of predicted normal maps. This approach enables the full optimization of Gaussian geometric parameters and decreases inaccuracy, thereby improving the quality of 3D surface reconstruction.

优点

The proposed method introduces novel insights, such as supervising normals rendered from 3D Gaussians, which only updates the rotation parameter. This is quite interesting. The overall model is supported by extensive experiments across multiple datasets, demonstrating significant improvements in both reconstruction quality and rendering speed compared to previous methods. The simplicity and effectiveness of this method are particularly commendable. Additionally, the paper is well-written and presents its concepts in a clear and accessible manner.

缺点

There are not many significant weaknesses in the method. However, a more in-depth analysis of scenarios where the method may fail, particularly with highly inconsistent normal predictions, would provide a clearer understanding of the method's boundaries.

问题

Please see the weakness section.

局限性

Yes.

作者回复

Response to Reviewer kSsq (R#1)

Q1: A more in-depth analysis of scenarios where the method may fail, particularly with highly inconsistent normal predictions, would provide a clearer understanding of the method's boundaries.
A: Our method is likely to fail for semi-transparent objects (shown in Fig. 11 of paper) since the monocular normal estimator (cf. Line 168 of paper) we use to supervise the normal map (cf. Eq. 8 and Eq. 11) cannot accurately estimate the normal prior of semi-transparent objects. The failure of the monocular normal estimator on semi-transparent objects is due to the inability of depth cameras to capture ground truth depth maps accurately, which are converted to ground truth normal maps, of semi-transparent objects for training.

评论

Thanks to the authors for their detailed responses and extensive experiments during the rebuttal period. I have reviewed the rebuttal, including discussions with other reviewers, and my concerns have been addressed. I believe this is solid work that will further contribute to the community, and I will maintain my original rating. However, I also recommend that some claims in the paper require further justification, such as the statement that only rotation parameters are updated, as discussed with reviewer 5p3w. I encourage the authors to incorporate the theoretical analysis provided during the rebuttal period into the main paper or supplementary materials to enhance the readers' understanding.

评论

Thank you for your review. We will incorporate the theoretical analysis and refine some claims in the final version.

评论

Dear Reviewer kSsq,

As we are nearing the end of the author-reviewer discussion period: please take a look at the author rebuttal as soon as possible and see if any of your concerns were addressed. Let the authors know that you read their rebuttal even if it didn't change your opinion of the paper. If your opinion changed, please update your review accordingly.

Thank you for your service as reviewer!

--Your SAC

作者回复

To all reviewers:

We thank the reviewers for their constructive feedback. As summarized by our reviewers, this paper introduces “novel insights” (R#1) with “clear contributions” (R#3), and the proposed method is “simple and effective” (R#1, 3). Through “comprehensive evaluation” (R#4), our method achieves “good performance” (R#2) with “significant improvements” (R#1).

Here, we explain the difference between our method with our concurrent works: 2DGS and Gaussian Surfels dd to answer R#3Q2 and R#4Q7.

A: We cited 2DGS and even showed experimental comparisons with 2DGS in our paper. However, Gaussian Surfels first appeared on arXiv at the beginning of May 2024, which was too close to the NeurIPS submission deadline. We did not notice the paper until after our NeurIPS submission. It should also be noted that both 2DGS and Gaussian Surfels are SIGGRAPH 2024 papers which took place less than 2 weeks ago, well after the NeurIPS submission deadline in May 2024. Although we consider Gaussian Surfels as our concurrent work, we will cite and compare with it in our final paper.

We differ from 2DGS and Gaussian Surfels as follows:

  1. We show an insightful and important analysis on how our D-Normal formulation can lead to a complete and effective update of all the geometric parameters (rotation, scale, and position) of the Gaussian (cf. L184-185 of our paper) in comparison with the naive formulation of normal which updates effectively only the rotation and scale (cf. Eq. 5 of our paper). We also mention in our paper that our derivation of D-Normal is inspired by VLN (cf. reference 5555 in our paper) which was originally derived for depth map prediction.

    Since this important analysis is missing in 2DGS and Gaussian Surfels, they did not realize that it is better to supervise D-Normal with only monocular normal map predictions. As a result, both 2DGS and Gaussian Surfels propose the weaker normal consistency loss (cf. Eq. 14 of 2DGS paper), which updates both the splat’s normal and depth normal and thus can lead to conflicting updates that hurt performance. Although Gaussian Surfels also supervised the depth normal with monocular normal, their addition of the weaker normal consistency loss that also updates the splat’s normal can cause the performance to drop (cf. column A vs B of the table in the response to R#3).

  2. To address the inconsistency across multiple views of the normal prior from a pretrained monocular model, we propose a geometrically meaningful confidence term (cf. L187-198 of our paper). This is based on the cosine distance between the rendered and predicted normals to downweigh inconsistent normal priors across multiple views. Consequently, our confidence term suppresses excessive errors with high inconsistency from dominating the overall cost. This important confidence term is not present in both 2DGS and Gaussian Surfels.

  3. Although the normal supervision has improved the normal accuracy, we notice that there is still a minor error leading to depth error arising from the remnant large Gaussians. We thus devise a new densification that splits large Gaussians into smaller ones to represent the surface better. In contrast, both 2DGS and Gaussian Surfels stopped at regularizing the normals.

  4. Different from 2DGS which has to compute the intersection depth/point first and then render a novel view based on the point because of the different splatting method with original Gaussian Splatting, our method only utilizes the intersection depth for training and surface reconstruction and we can discard it for faster rendering during inference.

  5. Compared with Gaussian Surfels, the formulations of the intersection depth are different: ours simplifies the problem to the intersection between a ray and a plane which is exact, while Gaussian Surfels uses a local Taylor expansion to approximate the intersection.

Finally, we compare our method with Gaussian Surfels on the TNT and DTU datasets, shown in the table below. The result of Gaussian Surfels on DTU is directly from the paper. Since Gaussian Surfels did not conduct the experiment on the TNT dataset, we ran their official code on the TNT dataset with the same normal priors as ours. From the table, we can see our method outperforms Gaussian Surfels dd .

DatasetGaussian SurfelsOurs
DTU (CD ↓)1.190.8
TNT (F-Score ↑)0.170.4

Here are some references used in the rebuttal:

aa Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans. Eftekhar, Ainaz and Sax, Alexander and Malik, Jitendra and Zamir, Amir. ICCV'21.

bb Objaverse: A universe of annotated 3d objects. Deitke Matt, etc. CVPR'23.

cc MVImgNet: A Large-scale Dataset of Multi-view Images. Yu Xianggan, etc. CVPR'23.

dd High-quality Surface Reconstruction using Gaussian Surfels. Pinxuan Dai, etc. SIGGRAPH'24.

ee NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. CVPR'21.

最终决定

After the rebuttal and significant discussion between the reviewers and the authors, this paper received consistent positive scores.

Reviewers find that it is well-motivated, well executed with interesting methodological insights on the nature of the relationship of gaussian position, their depth, and their normals, as well as an extensive evaluation.

After significant discussion with SACs and other ACs, the AC concludes that this paper is of significant interest to the part of the community working on 3D gaussian splatting and novel view synthesis, but not necessarily relevant to the broader NeurIPS community. The AC thus proposes acceptance as a poster.