6.3

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.8

置信度

正确性3.3

贡献度2.5

表达2.8

NeurIPS 2024

Multi-times Monte Carlo Rendering for Inter-reflection Reconstruction

Zhu Tengjie,Zhuo Chen,Jingnan Gao,Yichao Yan,Xiaokang Yang

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

摘要

关键词

Inverse rendering3D reconstructionIndirect illuminationPhysically based renderingMonte Carlo RenderingRay tracing

评审与讨论

审稿意见

评分: 6置信度: 32024-07-10

This paper proposes a comprehensive inverse rendering method that incorporates indirect illumination to enhance the decomposition quality of environment lighting and BRDF materials. Specifically, this work uses multi-time Monte Carlo integration to model light transport and devises algorithms to accelerate computation by precomputing the diffuse term and leveraging an SDF-based geometry for initialization. Both qualitative and quantitative results are promising, demonstrating realistic shadows on glossy surfaces and higher PSNR values.

优点

The results exhibit a decent quality. By incorporating indirect illumination, the modeling of shadows and glossy surfaces achieves a remarkable degree of photorealism.
The underlying theory is sound to me. The GGX appearance model is de facto in industry, and the approximation applied for the diffuse term is standard and commonly used in real-time rendering.
As mentioned, I believe that some insights are derived from real-time rendering techniques and I personally appreciate it as accurately modeling indirect illumination is also important for effective inverse rendering.

缺点

The optimization time is relatively long compared to other methods.
Due to computational limitations, the number of bounces is restricted, thus limiting the ability to model shiny surfaces and confining the results to glossy surfaces.

问题

In general I enjoyed reading this paper. The content is well written and clear to me. Therefore, I’m on the positive side of this submission. A few questions regarding the technical details:

Will the geometry obtained from Sec. 3.3 be optimized in inverse rendering? The writeup suggests this (Lines 189-190), but I didn't find any experiments addressing it. I am curious about potential geometry improvements.
Regarding the material editing mentioned in Section 4.5, I have some questions about its implementation. It seems that material properties are represented as MLP. How do you intuitively edit different material properties, such as roughness? Or are these properties explicitly exported and stored as UV maps?
What does “orm” stand for in k_orm?

局限性

Limitations and potential negative social impact are well discussed.

作者回复

2024-08-07

This is indeed a current limitation of our approach. Using Monte Carlo integration to calculate the reflection equation itself is more computationally intensive than the real-time rendering split sum method used by nvdiffrec, because we have to use ray tracing and integration. To make matters worse, as the tracking depth increases, the number of rays we need to emit will increase exponentially, but fortunately, as we show in Figure 4 in the paper, a tracking depth of 2 is enough to handle most In the scene, the tracking depth of 3 will greatly increase the amount of calculation, but it cannot significantly improve the final effect.
We will optimize the geometry in subsequent calculations. But unfortunately only PSNR can be improved. Compared with non-optimized geometry, optimized geometry generally improves PSNR by 1 to 2 dB in the new perspective verification set, but the geometry cannot be improved. We speculate that this is the process of using the implicit method to learn geometry, and there is no rasterization step. Our differentiable rendering requires rasterization, so it should be the direct application of geometric PSNR caused by the difference in the forward process. lower. Therefore, we can overcome this shortcoming and obtain higher PSNR by optimizing the geometry in subsequent inverse rendering.
This is a map, where R, G, and B channels represent alpha, roughness, and metallic respectively.

评论- Details about the representation of the material

2024-08-09

Inspired by nvdiffrec's excellent work, our material learning is divided into two stage. The first stage, we use an MLP to represent the material, and the second stage, we use xatlas to generate texture coordinates automaticly. Then we sample the MLP to initialize 2D textures. Subsequently, the textures are optimized based on the gradients provided by the nvdiffrast. We experimentally compared two materials representations. Using mlp to represent materials can quickly converge，and the subsequent optimization with 2D textures can continue to improve the rendering and relighting results.

2024-08-13

Dear authors,

Thanks for clarifying my questions! I have no further questions and overall I really enjoyed reading this paper! I will keep my score as a Weak accept.

2024-08-13

Thank you for your reply and appreciation. We will include the disscussion in the final paper.

审稿意见

评分: 5置信度: 42024-07-13

The authors propose an inverse rendering method that reconstructs the geometry, materials, and lighting of 3D objects from 2D images, effectively handling scenes with multiple reflections and inter-reflections. To address the high computation cost of Monte Carlo sampling, the authors propose a specularity-adaptive sampling strategy, significantly reducing the computational complexity. The authors also introduce a reflection-aware surface model to initialize the geometry and refine it during inverse rendering, to improve the inverse rendering reconstruction quality. Experiments show the proposed method outperforms other inverse rendering methods.

优点

The paper is well-written and easy to understand.
The proposed methods are intuitive and the results look nice.

缺点

The validation on real data is missing. The paper only shows its results on synthetic data. It would be better to test this work on challenging real data, such as the specular real data captured by NeRO.
The selected baselines are incomplete. There exists a strong baseline candidate, NeRO, which can also reconstruct the geometry and materials of specular objects. (Although its main focus is geometry reconstruction). Since it has released all of its code, it would be better if the authors could also compare this work with NeRO.
Need more analysis on the SG encoding. SG encoding looks like a very important part of the final quality(because it plays an important role in the geometry reconstruction). However, the authors only use a small part of Fig. 2 to showcase its effectiveness. More visual comparison and quantitative comparison would be appreciated. In the meanwhile, SG encoding seems to be very similar to [1], which was released on Arxiv last year. The authors should explain the difference between its SG encoding and [1]'s encoding. Since the authors didn't cite [1], I assume that SG encoding should be the author's original contribution.
The quantitative comparison is insufficient. More quantitative comparisons on the relighting results and reconstruction of BRDF and geometry should be included.

[1] SpecNeRF: Gaussian Directional Encoding for Specular Reflections

问题

Please show some results on real data with some baseline comparison.
Please provide more baseline comparisons and quantitative comparisons as discussed in the weakness section.
Please answer the question related to SG encoding.

局限性

The limitations have been discussed in the paper.

作者回复

2024-08-07

Missing validation on real data:
Thank you very much for your instructive feedback, which pointed out the oversight in our experiments. The NeRO real-world dataset is indeed a challenging dataset because the color of the object's surface can change significantly with different viewing angles, causing ambiguities in both material and geometry. We have supplemented the results on the two NeRO real-world datasets, bear and coral. On the bear dataset, we detailed our rendering results, PBR materials, and reconstructed geometry, as shown in Figure 1. Figure 2 shows our relighting results on the bear and coral datasets. From the relighting results, it can be seen that the disentangled PBR materials are reasonable, and our method effectively distinguishes between the object's material and the reflections on its surface. Figure 3 shows our geometric reconstruction results on the coral dataset. Our method achieves results very close to the ground truth, similar to NeRO, and shows significant improvements over inverse rendering methods like nvdiffrec and nvdiffrecmc.

Comparisons with NeRO:
NeRO is a recently outstanding work for object reconstruction, focusing on the challenging scenario of high-reflection objects and achieving excellent results. The difficulty of high-reflection objects lies mainly in two aspects: First, due to the nature of specular reflection, the observed color of an object changes significantly with the viewing angle, leading to view inconsistency. Second, for smooth objects, especially those with low roughness, indirect lighting has a greater impact and can even play a major role from certain angles. However, the calculation of indirect lighting is closely related to the object's geometry, the material properties of other regions of the object, and ambient lighting, making it a relatively difficult value to model and compute.

In the rebuttal pdf, we compared our work with NeRO in terms of rendering quality, PBR materials, geometric reconstruction, and relighting. In Figure 1, we provide a detailed comparison on the NeRF synthetic dataset materials and the NeRO real-world dataset bear. Our method achieved excellent results on both materials and bear. While NeRO performed well on bear, it encountered issues on materials. It can be seen that indirect lighting caused significant ambiguities in their geometry. Both our method and NeRO's method consider indirect lighting, but their indirect lighting is implicitly learned through neural networks, whereas ours is explicitly calculated using ray tracing and Monte Carlo sampling, which may explain the differences in our results. Figures 2 and 3 show the geometric reconstruction results on coral and our relighting results on bear and coral, respectively. These results demonstrate that our method is not inferior to NeRO in terms of geometric and material reconstruction of high-reflection objects, and in scenarios where light continuously reflects between multiple objects, we can achieve even better results.

评论- Analysis on the SG encoding

2024-08-09

Inspired by bakedSDF[1], we replaced the MLP with a set of SGs (spherical Gaussians) to directly obtain diffuse light. [2] also demonstrated the effectiveness of this approach. Given our focus on reflective objects, we employed different structures to more effectively distinguish between diffuse light and specular light. The SGE used for specular light is similar to SG but operates in a higher dimensionality. Since specular light is significantly influenced by the viewing angle and geometry, it is more complex than diffuse light, requiring more information for its representation.

[1] Lior et. al. Bakedsdf: Meshing neural sdfs for real-time view synthesis, SIGGRAPH 2023

[2] Christian et. al. Binary opacity grids: Capturing fine geometric detail for mesh-based view synthesis, TOG 2024

2024-08-14

Thanks for your reply and valuable comments.

To better capture the geometry of reflective objects, we employed the structure shown in Fig. 2(b), which differs from the representation used by SpecNeRF. The Gaussian Directional Encoding in SpecNeRF utilizes 3D Gaussians, and its representation is as follows: $\mathcal{G}_i(\mathbf{x})=\exp \left(-\left\|\mathcal{Q}\left(\mathbf{x}-\boldsymbol{\mu}_i ; \mathbf{q}_i\right) \odot \boldsymbol{\sigma}_i^{-1}\right\|_2^2\right),$ where $\boldsymbol{\mu}_i$ represents position, $\boldsymbol{\sigma}_i$ represents scale, $\mathbf{q}_i$ represents quaternion rotation.
In contrast, our SG encoding uses spherical Gaussians, and its representation is as follows: $G(\boldsymbol{\nu} ; \boldsymbol{\xi}, \lambda, \boldsymbol{\mu})=\boldsymbol{\mu} e^{\lambda(\boldsymbol{\nu} \cdot \boldsymbol{\xi}-1)},$ where $\boldsymbol{\xi}$ is the lobe axis, $\lambda$ is the lobe sharpness, and $\boldsymbol{\mu}$ is the lobe amplitude.
We used two environment lights Golden Bay and Limpopo Golf Course from Poly Haven [1] and Materials as ground truth to generate the relighting dataset. We report the PSNR results as follows: | | Golden Bay | Limpopo Golf Poly | | |---|---|---|---| | Ours | 19.38 | 19.72 | | | nvdiffrecmc| 16.12 | 16.47 | | | nvdiffrec | 15.90 | 16.04 | |

We apologize for overlooking your comments on quantitative results. We present the quantitative metrics results (discussion with reviewer UfeL) as follows for you convenience:

Evaluation of the albedo and the $k_{orm}$

PSNR ( $\uparrow$ )	albedo	$k_{orm}$	normal
Ours	21.50	18.56	28.58
nvdiffrecmc	20.04	18.67	26.09
nvdiffrec	17.58	18.64	18.31

Evaluation of the reconstructed geometry

Chamfer Distance ( $\downarrow$ )	coral	bear	materials
NeRO	0.13	0.11	0.0057
Ours	0.13	0.10	0.0030

[1] Poly Haven. https://polyhaven.com/

2024-08-14

Thanks for the authors' replies, which solve some of my questions, but some questions still remain:

As I said in the w3, it seems that SG encoding is very important to the result quality. Fig.5 of the rebuttal file indicates the same thing. I said in my review that I wanted to know the difference between the SG encoding used in the [1] and the authors' methods. I am not sure if the SG encoding can be considered as the original contribution of this work. But the authors did not fully address this in their response.

[1] SpecNeRF: Gaussian Directional Encoding for Specular Reflections

I mentioned in the weakness section that the quantitative comparison of this paper is insufficient. I am not sure why the author didn't respond to this concern in the rebuttal reply for me. I noticed that Reviewer UfeL had similar concerns, which the authors addressed in their rebuttal. However, it seems that a quantitative comparison for relighting is still missing.

Given these unresolved issues, I don't plan to raise my score.

审稿意见

评分: 7置信度: 42024-07-13

This paper proposes an inverse rendering method that handles multi-time Mante Carlo integration which models indirect illumination. It reduces the computational cost by pre-computing the diffuse map based on a Lambertian model. It also proposes to use spherical Gaussian encoding to improve the initial SDF reconstruction of scenes.

优点

The presentation is clear and easy to follow.
Modeling indirect illumination is so far under-explored in inverse rendering

缺点

Overclaim of novelty: multi-bounce MC integration has already been employed in earlier inverse rendering work [1]. On the other hand, the diffuse Fresnel term (Eq. 7) is not commonly used in recent inverse rendering works. Most related works (TensoIR, nvdiffrec/nvdiffrecmc) use exactly Eq 9, so the claim of transferring from Eq. 7 to Eq. 9 should not be counted as a contribution.
Continuing from (1), using a pre-computed diffuse map also prevents correct modeling of diffuse shadows, this is not discussed nor mentioned in the paper.
Inaccurate/incomplete description of the geometry module: IPE is used for encoding input positions to MLPs, while judging by the text (line 181) spherical Gaussian should be used for encoding viewing/reflecting directions. However, in this work, spherical Gaussian is actually used to replace IPE, which is very confusing. Also in Figure (2b) the entire Diffuse MLP is replaced with "SG" and there is no explanation of what this stands for. Also in the same figure, the term "SGE" is not explained (maybe spherical Gaussian encoding? Then what is the difference between SGE and SG?)
Weak evaluation: there is only a single table in the entire paper. Only PSNR and training time are reported. It is not clear how the PNSR metric is computed (is it novel-view synthesis? Is it relighting?). There is no evaluation of other intrinsic properties that are commonly evaluated in other inverse render papers, e.g. albedo, normal error, etc. The proposed spherical Gaussian encoding is not properly ablated - only qualitative comparison is presented.
Missing datasets: it would make the evaluation more complete if the method could evaluate more standard datasets (NeRFactor dataset, TensoIR dataset) where many other works have been evaluated. Also, there is no real-world dataset tested, making it questionable whether the proposed algorithm can work in real-world scenarios.

Factual errors: GGX has nothing to do with the diffuse term used in Disney BRDF. GGX is a microfacet normal distribution function that models specifically the specular part of the Disney BRDF. The diffuse Fresnel term in Disney BRDF is invented by the original Disney BRDF paper and is often omitted in real-time rendering engines and replaced with the Lambertian model.

[1] Mai et. al. Neural Microfacet Fields for Inverse Rendering, ICCV 2023

问题

Please see the weaknesses.

局限性

A crucial limitation, i.e. diffuse shadows are not handled by the proposed method, are omitted and not properly discussed.

作者回复

2024-08-07

Overclaim of novelty
The paper Neural Microfacet Fields for Inverse Rendering is indeed an excellent work that considers indirect lighting, but their method for computing indirect lighting is significantly different from ours. As stated in their paper, they approximate the rendering equation into form Equ(15), but a key variable in this equation is obtained by calculating the environment light map. This is fundamentally different from our approach. Our method involves performing Monte Carlo sampling again, essentially repeating the same method to compute it anew. Their method essentially belongs to the implicit learning of indirect lighting, which I believe does not involve computing indirect lighting using the multi-bounce MC integration method. Furthermore, in our paper, we list Equations 7 to 9 to explain the principle of approximating the diffuse part of our indirect lighting using the diffuse map. Although real-time rendering

Diffuse map
Our precomputed diffuse map is used only to approximate the diffuse part of the indirect lighting. Although the diffuse map does influence the modeling of shadows, it affects the modeling of light within the shadows rather than the light directly entering the viewer's eye. In our Table 1, we also demonstrate the impact of using the diffuse map on the results. The results show that using our acceleration structure has almost no impact on the accuracy of the final results, while significantly saving computation time.

Weak evaluation and Missing dataset: Thank you for pointing out the shortcomings of our quantitative comparative experiments. The PNSR metric is computed in novel-view synthesis. Indeed, our current judgment on the quality of target material decoupling is indeed at the level of subjective judgment based on life experience, so I use the table-horse data set in our data set , generated the albedo and roughness of his GT, and used this data set to quantitatively measure the decoupling effect of our work. The influence of geometry on the final effect does require quantitative experiments to prove the effectiveness of the method we use. Therefore, we have made some supplements to the experiments in the geometry part. We have put the quantitative comparison of geometry on the final effect in the PDF. Figure 5.

Otherwise, thank you for pointing out the inaccuracies in our paper. According to Disney's article and the book Real time rendering, GGX is indeed described as a function of the surface normal vector distribution, and it does not represent the BRDF proposed by Disney. Thanks again for finding our error and correcting it.

评论- Evaluation of the albedo and the k_orm

2024-08-09

We computed the PSNR metric on our table horse dataset. The results are shown in the table below:

PSNR	albedo	$k_{orm}$	normal
Ours	21.50	18.56	28.58
nvdiffrecmc	20.04	18.67	26.09
nvdiffrec	17.58	18.64	18.31

评论- Description of The Geometry

2024-08-09

Our SG(spherical Gaussian) lobes were not used for encoding viewing/reflecting directions. Similar to the method in [1], the feature output by the sdf network is combined with a set of SG lobes to compute the color values. [2] also demonstrated the effectiveness of this approach. However, since we are dealing with more complex reflective objects, we adopted two different structures to better distinguish between diffuse light and specular light. The SGE used for specular light is similar to SG but operates in a higher dimension. We used higher dimensions to represent specular reflection because specular reflections are view-dependent.

[1] Lior et. al. Bakedsdf: Meshing neural sdfs for real-time view synthesis, SIGGRAPH 2023

[2] Christian et. al. Binary opacity grids: Capturing fine geometric detail for mesh-based view synthesis, TOG 2024

2024-08-10

Thanks for the reply. I have additional questions regarding the author's arguments:

The author argues that in Neural Microfacet Fields (NMF for short), Eq. 15 uses the environment light map which is different from your design. I believe that here you mean $E(\mathbf{p})$ which is termed irradiance environment map. The term might be misleading but it essentially records pre-integrated cosine-weighted incoming light over the hemisphere for each possible normal direction. This is (almost) equivalent to the pre-convolved diffuse light map in nvdiffrec (as well as this work if I understand correctly). The only difference would be that they further compress the irradiance map into SH representation, while in nvdiffrec the pre-convolved diffuse light map is stored in a 6x512x512 cubemap. On the other hand, NMF uses MC sampling to compute the rendering equation, and they trace 2 bounces for indirect illumination. Please check Sec. 4.5 and Tab. 1 of the paper. Fundamentally it uses the same MC integration with mult-bounce (2) sampling to model indirect illumination.
Regarding the diffuse map, the paper states that "we can precompute it via an MLP" (line 165), but how? What is the difference between your diffuse map and that of nvdiffrec, except for MLP vs. explicit tensor?
Ignoring diffuse shadow would not impact the final results, most likely because the tested scenes are specular. This is why I asked the author to test more datasets, especially the TensoIR dataset where objects are not that specular (but also not fully diffuse)
Also, NeRO's glossy-real dataset contains ground-truth geometry. It would make the evaluation more complete if the authors can provide a quantitative comparison on this glossy-real dataset against NeRO.
I do not find anything related to diffuse map ablation in Tab. 1, can you elaborate on which entry in the table specifically reflects the effectiveness of diffuse map? Also, what does the "w.o Acc." entry do?

2024-08-11

Thank you very much for your comments.
1. Based on Section 4.5 of NMF, it is indeed that our rendering method is quite similar when our depth is set to 2. However, there are still significant differences in our work：

NMF uses a density field to represent geometry, whereas we use a triangle surface mesh. Since our representation method is consistent with the widely adopted approach in the industry, it allows us to easily leverage hardware-accelerated ray tracing.
NMF mentions in Section 6, "It also does not handle interreflections very well, since the number of secondary bounces is limited." However, continuing from (1), the number of secondary bounces in our approach is not limited.
Additionally, our method allows for easily increasing the ray tracing depth. A higher tracing depth is more effective in scenes where light repeatedly bounces between two shiny objects.

Thank you very much for pointing out that we overlooked this excellent work. We will conduct more experiments in the future to compare the effectiveness of our approaches. Additionally, we will validate our method in challenging scenes where light repeatedly bounces between two shiny objects.

We use the term secondary to denote a ray bounced from a surface position $p$ , similar to NMF. For the secondary color, we use equation 10 to calculate the diffuse color. Furthermore, the diffuse light at point $p$ can be expressed using an equation $f(p)$ , which is why we can represent it using an MLP. However, for light directly entering the eye, we still use Monte Carlo sampling to compute the diffuse light rather than relying on this MLP.

Our directly viewed diffuse light includes shadows, and we use it to supervise the diffuse light of our secondary rays. This essentially means that we have baked the shadows into the diffuse light, which is a key difference from the diffuse light used in nvdiffrec. Although our focus is on reflective objects, we will also conduct additional experiments on non-reflective datasets, similar to TensoIR, in the future. We greatly appreciate your feedback regarding the datasets.

We evaluated the Chamfer distance, and the results are shown in the table below:

Chamfer Distance ( $\downarrow$ )	coral	bear	materials
NeRO	0.13	0.11	0.0057
Ours	0.13	0.10	0.0030

It can be seen that on the bear and coral datasets, our geometry performs just as well as NeRO's, but on the materials dataset, our results are better.

Our "acc" means the diffuse map, and "w/o acc" indicates that for the diffuse color of the secondary rays, we still use ray tracing directly, similar to how it's done for the primary rays.

2024-08-12

Thanks for the prompt reply.

I am convinced by the argument on the differences between NMF and the proposed method. Please include such discussion in the final version of the paper.

2-3. Regarding diffuse light, now I see there was a misunderstanding on my side. I thought the diffuse light (Eq. 7-10) refers to the outgoing radiance towards the camera/eye, but it actually describes the incoming radiance towards the intersection between the primary (camera) ray and the object surface. Eq. 7-10 does not consider the shadowing effect, but Eq. 6 considers the shadowing effect for the diffuse part of the primary ray via ray tracing. So for diffuse color, you only trace one bounce, while for specular color, you trace more than one bounce (up to 3 in the experiments). Please correct me if I am wrong in any of the above statements.

The results look good, I am also convinced of this part

I am now inclined to accept the paper, though I still have some additional questions regarding the diffuse light:

How is the environment map represented? Is it the same as nvdiffrec/nvdiffrecmc, i.e. environment map is stored as a mipmap/tensor?
How do you "precompute" Eq. 10 exactly? You mentioned that the whole term (or just the light integration part?) is represented as an MLP. However, the environment light is not known prior. You need to optimize the environment map during the training process as I understand. Then how do you precompute Eq. 10 and store it in an MLP, as the environment map is changing during the training process?

2024-08-12

Thank you very much for your positive reply. We sincerely appreciate you pointing out that we overlooked an excellent work, NMF. In the final version of our paper, we will include the comparison between our work and NMF.

Your statement is correct. Indeed, for diffuse color, we only trace one bounce, while for specular color, we trace more than one bounce. The light that reaches the eyes is accounted for by ray tracing, including shadows.

Similar to what is described in equation 10, the diffuse light at point $p$ can be represented as $f(p)$ , which is a function dependent solely on the position of the object. Empirically, we can represent it in the same form as $k_{d}$ or $k_{orm}$ . It is the same as in nvdiffrec/nvdiffrecmc and is stored as a mipmap.

As for the precomputation in Eq.10, we initially had similar concerns, which led us to conduct the following three sets of experiments：

At the beginning of the optimization, we set depth=1 to quickly learn a rough diffuse light map.
Set depth=2, and at the beginning of the experiment, do not use the diffuse light map structure. Once the optimization reaches a certain level, then applying the diffuse light map.
Directly optimize using an unknown diffuse light.

Our results indicate that directly using the diffuse map can achieve the same effectiveness as the previous two approaches, and the unknown diffuse light map can be optimized quickly. We would include this in our final paper.

Once again, thank you for your feedback and valuable comments.

2024-08-13

I am mainly confused about the statement "we can pre-compute it via an MLP" (line 165). Judging by lines 199-200, it seems that this diffuse light MLP is supervised solely by the diffuse color prediction from the model. So it seems that it has nothing to do with "pre-computation". If this is the case, then the MLP is just self-supervised and the term "pre-compute" is causing unnecessary confusion.

On the other hand, precomputation of the light integral in Eq. 10 can be done by convolving the stored environment map tensor with the clamped cosine lobe - this operation is quite efficient during training. I was thinking along this way as this is also used in nvdiffrec.

Overall, I would recommend the authors to make the following changes in the final version:

The diffuse light MLP is used to predict only light beyond one bounce. This is hinted in Fig. 1, however, the transition around lines 153-154 is a bit hard to follow. It should be stated more clearly by changing the notation in equations to denote that diffuse light in this section is for indirect illumination only. Lines 154-162 are also unnecessary as the diffuse term is not really a part of GGX model, and most existing works (nvdiffrecmc, for example) already use the Lambertian model for the diffuse appearance
Describe accurately that the diffuse light MLP is self-supervised by the diffuse appearance prediction of primary rays, and remove the "pre-compute" term that might causes confusion for readers.

2024-08-13

Thank you for the instant reply. The "precompute" term here indeed may cause confusion and we will consider removing it in the final version. We will modify the corresponding part in the final version based on the discussion to make the sentences clearer. Thanks again for the instructive comments.

审稿意见

评分: 7置信度: 42024-07-19

This paper presents a method for learning disentangled scene representations from images. The proposed pipeline has two stages: the first recovers geometry using an SDF-based model to learn geometry from the images through differentiable rendering. The second applies differentiable ray tracing to predict material parameters and environment lighting. The proposed method differs from previous approaches in that it renders inter-reflections between objects which create indirect lighting rather than only reflections of the environment and shadows. This is shown to produce better results than existing methods in scenes where these inter-reflections are present.

优点

The core idea is solid, and seems to have the expected effect on improving both recovery of environment lighting and rendering quality. I think it would also be fairly applicable to other pipelines which use ray tracing to recover lighting, most likely with similar benefits.

The explanation is clear, and I think the method should be reproducible from the paper, and the authors have also said they will release code.

Generally, I think this is a good contribution tackling a quite difficult problem. The strategies of diffuse pre-computation and spherical Gaussian radiance models will likely be useful to future efforts in this direction.

缺点

It would have been nice to see some results on real data. The synthetic examples seem to show a clear improvement, but it would help to see the difference in a more practical setting.

问题

Would it be possible to run on the real capture dataset from NeRD, similar to nvdiffrec?

局限性

The authors are clear about limitations, and I think they cover them well.

作者回复

2024-08-07

More experiments on real-world datasets

Thank you for your valuable suggestions. Experiments on real-world datasets are important for assessing the practicality of inverse rendering tasks. Therefore, we have supplemented the experiments on two real-world datasets, i.e., NeRD and NeRO. On the NeRD dataset, we compared our method with nvdiffrec and nvdiffrecmc. Additionally, the NeRO dataset includes several high-reflection objects, making it a challenging dataset. Conducting experiments on this dataset allows for a convenient comparison with the strong baseline NeRO. Thus, we also performed experiments on this dataset, with results shown in Figures 1, 2, and 3 in the rebuttal pdf.

These results demonstrate that our method can generate detailed geometry and accurate BRDF materials from multi-view images, even when dealing with challenging real-world objects. This capability results in impressive relighting outcomes. However, our results on the NeRD dataset did not show significant improvements over nvdiffrec and nvdiffrecmc, which could be due to two main reasons: first, the object's base color is very strong, making it difficult for indirect lighting to have a significant impact. Second, the object has a predominantly convex shape, making it challenging to receive indirect light reflected from other areas. Therefore, our method's strengths are not fully demonstrated on this particular object.

作者回复

2024-08-07

We sincerely thank all reviewers for their constructive comments. Your suggestions have been invaluable in refining and strengthening our work. In this general response, we will address the three important parts that were commonly mentioned in the discussions, namely the experiment on the real-world dataset, the comparison experiment with a strong baseline candidate, and the experiment on the geometry module.

Experiments on real-world dataset

We acknowledge the importance of evaluating our proposed method on the real-world dataset to demonstrate its applicability and generalization. We have taken your advice into consideration and conducted extensive experiments on a diverse and challenging real-world dataset. For ease of comparison with the current strong baseline NeRO, we conducted experiments on the high-reflection real-world datasets bear and coral captured by them. The results are shown in Figures 1, 2, and 3. Additionally, we conducted experiments on the NeRD real-world dataset, with results shown in Figure 4. Through these experiments, we aim to demonstrate the effectiveness and generalizability of our approach in handling challenging real-world datasets.

Comparison experiment with a strong baseline candidate

We appreciate your suggestion to compare our work with the latest outstanding work. NeRO is an excellent contribution to the field of object reconstruction, focusing on high-reflection objects, much like our work. To demonstrate the effectiveness of our method, we conducted a comprehensive comparison with NeRO from three aspects. Figure 1 presents a thorough comparison on the NeRF synthetic dataset materials and NeRO’s real-world dataset bear. Figure 2 shows our relighting results on real-world datasets. Figure 3 displays our geometric reconstruction results on the NeRO real-world dataset coral. These results clearly demonstrate that our method performs on par with NeRO on their real-world datasets, and we achieve even better results in scenarios like the materials dataset, where light continuously reflects between objects.

Experiment on the geometry module
Thank you very much for pointing out the shortcomings in our geometric experiments. In our method, high-quality geometry is indeed a critical factor affecting the final results, as our approach to calculating indirect lighting relies on accurate geometry. Ambiguities in geometry can also lead to ambiguities in the object's material and surface color. Therefore, in Figure 5, we present the direct impact of learning geometry with our improved method compared to the previous method on subsequent results, as well as the quantitative comparison in terms of PSNR.

We believe that these revisions and additional experiments will strengthen our paper and contribute to the advancement of the field. We are grateful for your valuable comments and appreciate the opportunity to enhance the quality and impact of our work.

最终决定Accept (poster)

2024-09-25

This paper received four positive leaning reviews --- one 5 (borderline accept), one 6 (weak accept), and two 7s (accept).

There was broad appreciation for the technical soundness of the work, clean presentation, high quality of the results, and the overall importance of the problem addressed herein (modeling interreflections in inverse rendering).

There were some concerns raised around limited real experiments, and other technical issues which the authors addressed in a comprehensive rebuttal. There was substantial post-rebuttal discussion, following which, an accept consensus was reached. Hence, the decision to accept the paper.

The authors are strongly urged to incorporate the reviewer feedback and include the additional comparisons / validation promised in the rebuttal when preparing the final camera-ready version.