7.3

/10

Poster3 位审稿人

最低6最高10标准差1.9

3.3

置信度

ICLR 2024

InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules

Yanqi Bao,Tianyu Ding,Jing Huo,Wenbin Li,Yuxin Li,Yang Gao

OpenReview PDF

提交: 2023-09-16更新: 2024-03-06

TL;DR

We present InsertNeRF, a novel paradigm that instills generalizability into NeRF and its derivative works.

摘要

关键词

Neural Radiance FieldsHypernetworkNeural RenderingGeneralizability

评审与讨论

审稿意见

评分: 10置信度: 42023-10-29

This paper presents a method called InsertNeRF, which aims to instill generalizability into Neural Radiance Fields (NeRF) without extensive modifications to the vanilla NeRF framework. The method utilizes multiple plug-and-play HyperNet modules to dynamically tailor NeRF's weights to specific reference scenes, allowing for more accurate and efficient representations of complex appearances and geometries. The main contributions of InsertNeRF are: (1) introducing a novel paradigm that inserts HyperNet modules into NeRF-like systems to achieve generalizability, (2) designing two types of HyperNet module structures tailored to different NeRF attributes, and (3) demonstrating state-of-the-art performance and potential in various NeRF-like systems. The paper also discusses related works on generalizable NeRF and hyper-networks, and provides an overview of the method's background and implementation details.

优点

-The idea of using HyperNet to solve the important problem of generalizable NeRF is a very good and promising pipeline. Thus, the novelty of the proposed method is very strong. -The proposed InsertNeRF can achieve the generalizability without extensive modifications to the vanilla NeRF framework. -HyperNet modules can dynamically tailor NeRF's weights to specific reference scenes, allowing for more accurate and efficient representations.

The proposed InsertNeRF achieves superior generalization performance and can be integrated with other NeRF-like systems. It also demonstrates state-of-the-art performance and potential in various NeRF-like systems, even in sparse input settings.
The writing and the presentation of the paper is good.

缺点

-All the contributions, points, and the details of methods and experiments have been clearly presented.

问题

-How about the computational cost and efficiency of the proposed method compared with other methods?

It is suggest to public their Source code for the readers to better reproduce the proposed method.

伦理问题详情

Generalizable NeRF is a very important issue, and the authors propose a excellent solution by using the idea of HyperNet. Overall, this is a high-quality submission and I would like to rate it as strong accept.

评论- Response for Reviewer km3C

2023-11-15

General reply: Thank you for your thorough, perfect summary of our paper and positive comments. We are greatly encouraged by your assessment of our paper as novelty and well-written, acknowledging that our paper successfully addresses a crucially important issue of NeRF. In complete agreement with your statement, we also believe that Generalizable NeRF is a pivotal issue for future research, as it will be a crucial step in the practical application of NeRF. If you have any additional concerns, please let us know and we will get back to you as soon as possible!

Q1.How about the computational cost and efficiency of the proposed method compared with other methods?

A1. Thank you for bringing this up. Computational cost and efficiency are crucial aspects in the experimental process. Therefore, we addressed this issue in Figure 4 of the initial draft. We employed bar charts to depict evaluation times and compared them with the state-of-the-art method, GNT (2023 ICLR). To further discuss this point, we additionally tested GeoNeRF(ECCV 2022) under Setting I, which evaluated at 24s under 3 inputs, while InsertNeRF took 11.3s. It is evident that our approach demonstrates faster inference speeds across varying numbers of reference images, possibly attributed to our framework not to use the popular Transformer architecture. At the same time, thanks to the non-Transformer architecture, the computational cost of our approach is significantly reduced.

Q2.It is suggest to public their Source code for the readers to better reproduce the proposed method.

A2. We agree wholeheartedly with your comment. We promise to open-source the code after publication. Thank you for your suggestion.

评论- Thanks for your response.

2023-11-22

I have read the authors response. The authors have addressed all my questions. I keep my score.

评论- Thank you for your time and effort in reading our response!

2023-11-22

Dear Reviewer km3C

Thank you for your time and effort in reading our response! If you have any additional concerns, please let us know and we will get back to you as soon as possible!

Thank you!

Paper 693 Authors

审稿意见

评分: 6置信度: 32023-10-30

In the paper authors use multiple plug-and-play hypernetworks modules in NeRF based models to obtain generalization properties. The proposed method can be integrate with many different NeRF-based models.

优点

The papers obtain good experimental results.
The framework can be used for many different NeRF architectures

缺点

The paper is hard to read. It is unclear what is input to new meeting components.
It is hard to understand the general idea of the model, and Fig. 2 is completely unclear.
Genera formulas (3) and (4) are unceler.
In the experimental section we do not have experiments with the ShapeNet-based dataset (see pixelNeRF).

问题

The model use hypenetwork so shoud be compared with hypenetworks based method like Points2NeRF or HyperNeRFGAN
In Figure 1, the authors show that the introduced method works nicely with many NeRF framework but do not compare with other NeRF with generalization properties. It is misleading.
Section 2.1 do not include Hypernetworks based NeRF. Authors do not mention TriPlaneNeRF or MultiPlaneNERF. Furthermore the generative models using NeRF should be mentioned.
It is quite difficult to understand formula (3). F_view take as an input one point (x,d) or all points on the ray? F_sample take as an input one point (x,d) or all points on the ray?
It is quite difficult to understand formula (4). it looks like F_sample was change to NyperNet. What is an input to Hypernetwork and what is an input to F_sample? Such formula should be describe more carefully.
Figure 2 is completely unclear.
Authors write, "While this method can be adapted to a variety of NeRF-based systems (Sec. 4.4), we focus on its application on the vanilla NeRF in this section." and then directly built the method on formula (3) dedicated to Generalizable NeRF.
How difficult is to find parameters lambda_1 and lambda_2 in const function?
In experimental section authors shud add experiments on ShapeNet data similar to pixelNeRF.

评论- Response for Reviewer pJTL (4/5)

2023-11-15

Q7. Figure 2 is completely unclear.

We apologize for any inconvenience caused. We have redesigned Figure 2 in the new version. Due to space constraints, we omitted some descriptions of the entire pipeline and Fig.2 in main text, which may cause confusion about Fig.2. In the initial draft's Appendix C.2, we described the entire network's training and testing process using pseudocode, as Algorithm 1 and 2. Next, allow me to provide a detailed overview of the entire network process based on Fig. 2 ,

We depict the entire process of InsertNeRF in the Fig.2(a). In contrast to vanilla NeRF, we introduce the plug-and-play HyperNet modules. Next, we will provide a detailed description of how to construct the HyperNet modules.

1): Firstly, training scenes are sampled from the training dataset. And reference as well as target images, along with their corresponding camera parameters, are extracted from the sampled scenes. This part is described in the second line of the Algorithm 1 and the leftmost part of Fig. 2 (Novel Target View $\mathbf P_T$ and $N$ Reference Images).

$**Sample:** \left \\{ \left \\{ \mathbf I\_{T}, \mathbf P\_T\right \\}, \left \\{ \mathbf I\_{n}, \mathbf P\_n\right \\}\_{n=1}^N \right \\} \leftarrow \mathcal{D}\_\text{TrScene}\leftarrow \mathcal{D}\_\text{Train}.$

2): Based on $\mathbf P_T$ , sample the target rays to obtain sample points $\mathbf r(t_i)$ as vanilla NeRF. Utilize U-Net to extract features $\mathbf F_{l,n}$ from the $N$ reference images as described in formula (5). Differing from existing works, the additional $\mathcal{L}\_{\text{backbone}}$ is employed for providing auxiliary supervision in this process as described in Sec. 3.3. This part is described in the Fig. 2 (b).

3): According to the sampling position $\mathbf r(t_i)$ , camera parameters $\left \\{ \left \\{\mathbf P\_n\right \\}\_{n=1}^N, \mathbf P\_T\right \\}$ , and reference features $\mathbf F\_{l,n}$ , the features of each target sampled point can be extracted from $N$ reference features to get $\left \\{\mathbf f\_n\right \\}\_{n=1}^N$ as formula (4)'s $\mathbf F\_n()$ . This part is described in the second half of Fig.2 (c).

4): Then the $\mathcal{F}\_{\text{view}}$ is employed to aggregate features from $N$ reference features. In $\mathcal{F}\_{\text{view}}$ , the sampling features $\left \\{\mathbf f\_n\right \\}\_{n=1}^N$ are used as inputs to generate the scene features for each sampling position as described in Sec.3.2.2. This part is described in the top half of Fig.2 (c).

5): Then these scene features are introduced to Sampling-aware Filter to generate the Weights & Bias (green and yellow block for $\mathcal{F}\_{geo}$ and $\mathcal{F}\_{app}$ ) and DFILM (pink block) for the HyperNet modules as described in Sec. 3.2.3. This part is described in Fig.2 (d).

6): Then multiple Weights, Bias, and DFILM are used to construct two types of HyperNet modules in $\mathcal{F}\_{geo}$ and $\mathcal{F}\_{app}$ , as Fig. 6. Finally, these HyperNet modules are inserted into the original NeRF-based framework for $\mathbf \Omega\_T$ , where the original NeRF-based framework is represented by blue dashed lines. This part is described in Sec. 3.2.3, the fourth line of the Algorithm 1 and the top half of Fig.2 (a).

7): Finally, based on the sampling points along the target ray, the predicted colors and weights of the sampled points are obtained through the NeRF framework augmented with an attached HyperNet modules as $\mathcal{F}(\mathbf x, \mathbf d ; \mathbf \Theta, \mathbf \Omega\_T) \ \ \mapsto \ \ \left ( \mathbf c, w \right )$ in formula (4). The rendered colors are obtained through volume rendering, and the process is supervised using Photometric loss. Apart from the added HyperNet modules, this process remains the same as the vanilla NeRF. This part is described in Sec. 3.3 and the second half Fig.2 (a).

To further reflect your comment, we have redesigned Figure 2 and updated the description for Figure 2. All changes are highlighted in BLUE font.

评论- Response for Reviewer pJTL (5/5)

2023-11-15

Q8. Authors write, "While this method can be adapted to a variety of NeRF-based systems (Sec. 4.4), we focus on its application on the vanilla NeRF in this section." and then directly built the method on formula (3) dedicated to Generalizable NeRF.

We apologize for any inconvenience caused. In fact, our InsertNeRF is built based on $\mathcal{F}(\mathbf x, \mathbf d ; \mathbf \Theta) \ \ \mapsto \ \ \left ( \mathbf c, \sigma \right )$ (vanilla NeRF) as formula 1. With the aggregated scene features and HyperNet modules, InsertNeRF's entire process can be succinctly summarized as $\mathcal{F}(\mathbf x, \mathbf d ; \mathbf \Theta, \mathbf \Omega_T) \ \ \mapsto \ \ \left ( \mathbf c, w \right )$ in formula 4, where $\mathbf \Omega_T$ represents the adaptive parameters form HyperNet modules. Note that formula 1 and 4 can be replaced by other NeRF-based frameworks, and we also endow them with generalizability, as demonstrated in Sec. 4.4.

We describe formula (3) merely to illustrate the pipeline of existing generalizable NeRF, which helps emphasize the distinctiveness of our method. Formula (3) and (4) share the same process regarding $\mathcal{F}\_{\text{view}}$ , but the specific methods differ, as described in Sec. 3.2.2. For the generalizable NeRF task, we believe that extracting scene features from reference images, $\mathcal{F}\_{\text{view}}$ , is necessary, aligning with the approach of most existing works, which prompts the network to understand the scene. However, "how to leverage scene features to build a generalizable NeRF after $\mathcal{F}\_{\text{view}}$ " is a crucial aspect that needs exploration, which is also the main distinction between InsertNeRF and existing works as formula (4).

Q9. How difficult is to find parameters $\lambda\_1$ and $\lambda\_2$ in const function?

Thank you for your reminder and suggestions. We also believe that the values of $\lambda\_1$ and $\lambda\_2$ should be explained in the experimental details of the article. In our experiments, $\lambda\_1$ is set as 0.1 and $\lambda\_2$ is set as 1.

We have added the description for $\lambda\_1$ and $\lambda\_2$ to the Appendix. A for experimental details. All changes are highlighted in BLUE font.

Q10. In experimental section authors shud add experiments on ShapeNet data similar to pixelNeRF.

Thank you for your reminder and suggestions. In the majority of existing Generalizable NeRF works, including MVSNeRF[6], IBRNet[1], GeoNeRF[5], WaveNeRF[7], NeuRay[2], and GNT[4], their experiments were conducted on three datasets: NeRF Synthetic, LLFF, and DTU. Although these works mention pixelNeRF or are contemporaneous works, none of them has been experimented on the ShapeNet dataset. We speculate that this might be due to the NeRF Synthetic, LLFF, and DTU datasets having more complex textures and geometric features, allowing GNeRF tasks to operate in a broader range of scenarios.

If you are interested, we can further discuss the performance of InsertNeRF on the ShapeNet dataset.

[1] Wang Q et al."Ibrnet: Learning multi-view image-based rendering" CVPR2021

[2] Liu Y et al. "Neural rays for occlusion-aware image-based rendering" CVPR2022

[4]: Wang P et al. "IS ATTENTION ALL THAT NERF NEEDS?" ICLR 2023

[5]: Johari M M et al. "Geonerf: Generalizing nerf with geometry priors" CVPR2022

[6]: Chen A et al. "Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo" ICCV2021

[7]: Xu M et al. "Wavenerf: Wavelet-based generalizable neural radiance fields" ICCV2023

We hope our responses have addressed the concerns of the reviewer and we are happy to answer any further questions.

评论- Thank you for your answers.

2023-11-17

The authors have addressed all my questions, added experiments, and clarified some formulas. I raised my score.

评论- Thank you for your time and effort in reading our response!

2023-11-17

Dear Reviewer pJTL

Thank you for your time and effort in reading our response! If you have any additional concerns, please let us know and we will get back to you as soon as possible!

Thank you!

Paper 693 Authors

评论- Response for Reviewer pJTL (1/5)

2023-11-15

General reply: Thank you for your constructive review and helpful suggestions! We give a detailed response to your questions and comments in the following. If any of our responses do not adequately address your concerns, please let us know and we will get back to you as soon as possible.

Q1. It is unclear what is input to new meeting components.

We apologize for any inconvenience caused. This is explained in detail in Q7. and improved Fig. 2.

Q2.Compared with hypenetworks Points2NeRF or HyperNeRFGAN

Thank you for pointing this out. Points2NeRF focuses on training NeRF form point clouds, while HyperNeRFGAN attempts to generate a 3D-aware GAN. Points2NeRF and HyperNeRFGAN are all very forward-looking works, they directly incorporate HyperNetwork into NeRF. However, in the generalizable NeRF task, such idea is suboptimal, overlooking the characteristic of different attributes, such as volume density and color. This issue has been mentioned in the third paragraph of the Introduction section. Furthermore, they struggle to capture the relationship between the inputs (reference images) in the target's sampling process. Therefore, we propose two types of HyperNet module structures for $\mathcal{F}\_{geo}$ and $\mathcal{F}\_{app}$ and Sampling-aware Filter separately to mitigate the aforementioned two issues.

Since Points2NeRF and HyperNeRFGAN differ from the generalized NeRF task we primarily focus on, for a fair comparison, we directly introduce HyperNetwork into NeRF (no including HyperNet modules and Sampling-aware Filters), as mentioned in Points2NeRF and HyperNeRFGAN. In fact, this experiment validating our motivation was conducted in the early experiments, as mentioned in the third paragraph of the Introduction section.

	NeRF	Synthetic			LLFF
Model	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
NeRF with HyperNetwork as Points2NeRF and HyperNeRFGAN	25.86	0.902	0.081	24.25	0.793	0.177
InsertNeRF	27.57	0.936	0.056	25.68	0.861	0.126

To further reflect your comment, we have added the experiment in Table 2 and description in Introduction. As the experimental results indicate, the proposed HyperNet modules and Sampling-aware Filters significantly enhance the rendering performance of Points2NeRF and HyperNeRFGAN in GNeRF task, where HyperNetwork is directly introduced into NeRF. All changes are highlighted in BLUE font.

Q3. Figure 1 do not compare with other NeRF with generalization properties.

We apologize for any inconvenience caused. Firstly, we would like to confirm what 'other NeRF with generalization properties' specifically refers to; we presume it might be some works focused on generalizable NeRF.

For the generalizable NeRF, we visually compared existing works with our InsertNeRF for color-depth renderings in Figure 1b. In Figure 1b, the results on the left are from IBRNet [1], while the results on the right are from NeuRay [2]. Due to space limitations, more results are presented in the Appendix.

For derivative works of NeRF, to our knowledge, much of the current research on generalizability primarily focuses on vanilla NeRF. Our work is pioneering in exploring the generalizability of NeRF-based frameworks such as mip-NeRF and NeRF++, which achieves generalization in NeRF and other derivative works. This is also one of the key motivations behind this work. Note that, existing works have not specifically focused on derivative works of NeRF, and directly transferring them to other NeRF-based frameworks is challenging. Therefore, in Figure 1a, we only present visual comparisons between derivative works of NeRF, such as mip-NeRF and NeRF++, with our InsertNeRF-like systems. Additional results are presented in Figures 12 and 13.

[1] Wang Q et al."Ibrnet: Learning multi-view image-based rendering" CVPR2021

[2] Liu Y et al. "Neural rays for occlusion-aware image-based rendering" CVPR2022

评论- Response for Reviewer pJTL (2/5)

2023-11-15

Q4. Section 2.1 do not include Hypernetworks based NeRF. Authors do not mention TriPlaneNeRF or MultiPlaneNERF. Furthermore the generative models using NeRF should be mentioned.

We apologize for any inconvenience caused. In fact, we described the application of HyperNetwork in NeRF in Section 2.2. Due to space limitations, we supplemented some information about Image-based rendering and neural scene representation in the initial draft's Appendix B.

TriPlaneNeRF and MultiPlaneNeRF are representative works in the field of NeRF in recent years, which explore novel and more effective representations for NeRF. In fact, in the Related work section, we mentioned [3] also followed the similar idea.

Here,we additionally discuss more generative models, particularly those related to hypernetworks.

With the development of generative models, 3D generative models have been widely discussed, enabling the direct construction of 3D representations such as point clouds [1], surfaces[2], voxels[3] and NeRF [6]. A significant amount of works have leveraged techniques from image generative models and applied them to 3D generation, including GAN [4] and diffusion model [5]. In this work, we focus on some 3D generative models with HyperNetwork. [8] is an early work that builds variable size representations of point clouds with HyperNetwork. Then, HyperFlow [2] uses a hypernetwork to model 3D objects as families of surfaces and Points2NeRF[7] utilizes a HyperNetwork to generate NeRF from a 3D point cloud. Additionally, in recent years, there have been some NeRF works that focus on this technique, they directly incorporate HyperNetwork into NeRF, as described in Section 2.2. However, in the generalizable NeRF task, such idea is suboptimal, overlooking the characteristic of different attributes, such as volume density and color. Furthermore, they struggle to capture the relationship between the inputs (reference images) in the target's sampling process. Therefore, InsertNeRF proposes two types of HyperNet module structures for $\mathcal{F}\_{geo}$ and $\mathcal{F}\_{app}$ and Sampling-aware Filter separately to mitigate the aforementioned two issues.

To further reflect your comment, we have added the abovementioned related works about generative model and TriPlaneNeRF in Appendix B and HyperNeRFGAN in Sec.2.2. All changes are highlighted in BLUE font.

[1] Zamorski M et al. "Adversarial autoencoders for compact representations of 3D point clouds"

[2] Spurek P et al. "Hyperflow: Representing 3d objects as surfaces"

[3] Zhou L et al. "3d shape generation and completion through point-voxel diffusion"

[4] Kania A et al. "Hypernerfgan: Hypernetwork approach to 3d nerf gan"

[5] Liu Z et al. "Meshdiffusion: Score-based generative 3d mesh modeling"

[6] Poole B et al. "Dreamfusion: Text-to-3d using 2d diffusion"

[7] Zimny D et al. "Points2nerf: Generating neural radiance fields from 3d point cloud"

[8] Spurek P et al. "Hypernetwork approach to generating point clouds."

评论- Response for Reviewer pJTL (3/5)

2023-11-15

Q5. It is quite difficult to understand formula (3). F_view take as an input one point (x,d) or all points on the ray? F_sample take as an input one point (x,d) or all points on the ray?

We apologize for any inconvenience caused. Due to limited space, we simplify the description of existing generalizable NeRF processes in formula (3), which is consistent with most existing work.

$\mathcal{F}\_{\text{sample}}\left(\left\\{\\mathcal{F}\_{\\text{view}}\left( \left\\{\mathbf F_n\left( \Pi_n(\mathbf r(t_i))\right)\right\\}_{n=1}^N\right)\right\\}\_{i=1}^K\right) \ \ \mapsto \ \ \mathbf (\mathbf c,\sigma)$

In formula (3), $\mathcal{F}\_{\text{view}}$ takes specific points from all reference images $\left \\{ \mathbf I\_{n}, \mathbf P\_n\right \\}\_{n=1}^N$ as input, where the specific points are obtained by projecting One Sampled Point $\mathbf r(t_i)$ on the target rays onto the reference features. Specifically, given a point $\mathbf r(t_i)$ on the target ray, formula (3) projects $\mathbf r(t_i)$ onto the $N$ reference features through the references and target camera extrinsics $\left \\{\mathbf P\_n\right \\}\_{n=1}^N$ and $\mathbf P\_T$ , obtaining the relevant positions $\left \\{\mathbf z\_n\right\\}\_{n=1}^N$ in reference features. Finally, formula (3) extracts the corresponding features $\left \\{\mathbf f_n\right \\}\_{n=1}^N$ from the reference features at $\left \\{\mathbf z_n\right \\}\_{n=1}^N$ . The above process can be represented as $\left \\{\mathbf F\_n\left( \Pi\_n(\mathbf r(t\_i))\right)\right \\}\_{n=1}^N$ . These features $\left \\{\mathbf f\_n\right \\}\_{n=1}^N$ from different viewpoints are used as inputs for $\mathcal{F}\_{\text{view}}$ .

After respectively processing all points $\left\\{ \mathbf r(t\_i)\right\\}\_{i=1}^K$ along the target ray by the $\mathcal{F}\_{\text{view}}$ , $\mathcal{F}\_{\text{sample}}$ takes the features obtained for each point on the ray as inputs to obtain the final rendering color. Many existing generalizable NeRF follow a similar process as formula (3). Similar representation has also been adopted by a substantial number of existing works, such as GNT[4], NeuRay[2] and GeoNeRF[5].

Q6. It is quite difficult to understand formula (4). it looks like F_sample was change to NyperNet. What is an input to Hypernetwork and what is an input to F_sample? Such formula should be describe more carefully.

$\mathcal{F}(\mathbf x, \mathbf d ; \mathbf \Theta, \mathbf \Omega_T) \ \ \mapsto \ \ \left ( \mathbf c, w \right )$

$\ \mathbf \Omega_T = \text{HyperNet} \left ( \left\\{\mathcal{F}\_{\text{view}}\left(\left\\{\mathbf F\_n\left( \Pi\_n(\mathbf r(t\_i)) \right)\right\\}\_{n=1}^N\right)\right\\}\_{i=1}^K\right )$

In formula (4), similar to your understanding, our process involves replacing $\mathcal{F}\_{\text{sample}}$ with HyperNet. For $\mathcal{F}\_{\text{sample}}$ , it predicts the color values of rays by aggregating features from sampled points along the rays, replacing traditional volume rendering, which is typically implemented using Transformers. For HyperNet, we use scene features from $\mathcal{F}\_{\text{view}}$ to predict the parameters (or weights) $\mathbf \Omega\_T$ in the vanilla NeRF framework. And then, by incorporating the new parameters $\mathbf \Omega\_T$ , we have the ability to achieve generalization following the original NeRF framework and volume rendering. This is the primary distinction between InsertNeRF and existing methods.

[2] Liu Y et al. "Neural rays for occlusion-aware image-based rendering" CVPR2022

[4]: Wang P et al. "IS ATTENTION ALL THAT NERF NEEDS?" ICLR 2023

[5]: Johari M M et al. "Geonerf: Generalizing nerf with geometry priors" CVPR2022

评论- Additional discussion.

2023-11-20

Dear Reviewer pJTL,

Once again, thank you for your suggestions and assistance with our work. Regarding your points in Weaknesses-4 and Qusetion-9 about ShapeNet, while it hasn't been extensively addressed in recently works, we believe discussing this aspect is significative [1][2]. We've conducted additional experiments over the past few days:

	Chairs	1view	Chairs	2view	Cars	1view	Cars	2view
Model	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
ENR	22.83	-	-	-	22.26	-	-	-
SRN	22.89	0.89	24.48	0.92	22.25	0.89	24.84	0.92
ViewFormer[2] (ECCV 2022)	14.74	0.79	17.20	0.84	19.03	0.83	20.09	0.85
pixelNeRF[1] (CVPR 2021)	23.72	0.91	26.20	0.94	23.17	0.90	25.66	0.94
pixelNeRF+HyperNet modules	24.51	0.92	26.71	0.94	24.18	0.91	26.05	0.95

We explore the performance of the HyperNet modules on ShapeNet under chairs and cars scenes. Our work primarily focuses on multi-views under settings I and II, making multi-layer dynamic-static aggregation strategy's validation unnecessary in few views. Therefore, we integrate the HyperNet modules into the original pixelNeRF, altering its training inputs accordingly. As shown in the table above, our modules exhibit significant improvements compared to pixelNeRF, especially in the 1-view setting. It's also evident that compared to NeRF Synthetic, LLFF, and DTU, InsertNeRF shows less improvement on ShapeNet. This might be due to the relatively simplistic appearance and geometry of ShapeNet-Scenes, and our work primarily focuses on multi-view settings as mentioned in [2].

We have added the our experments to the Appendix. E. All changes are highlighted in BLUE font.

[1] Yu A et al. "pixelnerf: Neural radiance fields from one or few images"

[2] Kulhánek J et al. "Viewformer: Nerf-free neural rendering from few images using transformer"

审稿意见

评分: 6置信度: 32023-11-01

The paper targets the task of NeRF generalization, and proposes the HyperNet module before each MLP layer of vanilla NeRF, inspired by the concept of hypernetwork. The proposed method achieves comparable or even better performance compared to previous works.

优点

Such a plug-and-play hypernetwork-based method can bring an extra performance boost for NeRF generalization, which is proved to be useful.

Comparison to previous works and several ablation studies were performed to verify the effectiveness of the aggregation strategy and each component proposed in the HyperNet module.

缺点

The paper may need more explanations or descriptions about the insight of the method, to demonstrate why such a strategy or technical designs could help NeRF generalization—the analysis of ``why'' is also important.

The experiments are very simple and short, in Figure 3, the methods are close in visualizations. As a prior and baseline work of generalizable work, IBRNet shows close performance to the proposed InsertNeRF.

问题

See weaknesses.

评论- Response for Reviewer 8PMb (1/2)

2023-11-15

C1. why such a strategy or technical designs could help NeRF generalization—the analysis of "why'' is also important.

We apologize for any inconvenience caused. We briefly mentioned our motivation in the Introduction and Method section, but due to space limitations, we did not elaborate on it. Certainly, please allow me to analyze and explain 'why’ such a strategy or technical design could aid in NeRF generalization.

Descriptions: Vanilla NeRF can be considered as an implicit representation used to depict a scene through the parameters of a neural network, i.e. $\pmb\Theta$ , as described in Section 3.1.

$\mathcal{F}(\pmb x, \pmb d ; \pmb\Theta) \ \ \mapsto \ \ \left ( \pmb c, \sigma \right ),$

A natural idea is how to alter $\pmb \Theta$ for different scenes $\pmb{s}$ , so that $\pmb \Theta_\pmb{s}$ possesses the ability to represent this new scene. By sampling different $\pmb{s} \in \mathcal{S}$ and generating different $\pmb\Theta_\pmb{s}$ , this can be considered as endowing vanilla NeRF representation with generalizability in multi-scenes. However, unlike explicit 3D representations such as voxels, meshes, and point clouds, constructing an implicit representation ${\pmb\Theta}_{\pmb{s}}$ directly for a given $\pmb {s}$ is challenging.

Therefore, in this paper, we introduce the HyperNet, which is invented to generate weights for a target neural network, to address this issue. Through two types of the Hypernet modules we propose, scene-customization weights (parameters) $\pmb\Omega_T^{\pmb{s}}$ in the NeRF framework are generated for a given $\pmb{s}$ . Here, we predict $\pmb\Omega_T^{\pmb{s}}$ by combining the feature extraction from reference images and the multi-Layer dynamic-static aggregation strategy. Finally, by combining $\pmb\Theta$ and new weights $\pmb\Omega_T^{\pmb{s}}$ within the NeRF framework, we obtain ${\pmb\Theta}_{\pmb{s}}$ that can adapt to different scenes $\pmb{s}$ , as described in Section 3.2.

Later, we discuss "why'' the Multi-Layer Dynamic-Static Aggregation Strategy is meaningful in our network. Extracting reference scene features is crucial in achieving generalization in NeRF tasks. This Aggregation Strategy is primarily designed to adaptively explore occlusion relationships among points in the scene space, facilitating the extraction of accurate scene features. As described in section 3.2.2, predicting dynamic weights based on the features or scenes is advantageous in addressing spatial feature relationships in various complex scenes.

$\mathcal{F}(\pmb x, \pmb d ; \pmb \Theta, \pmb \Omega_T) \ \ \mapsto \ \ \left ( \pmb c, w \right ).$

Validation: To demonstrate that InsertNeRF possesses scene-customized parameter $\pmb\Omega_T^{\pmb{s}}$ , we visualized features from multiple viewpoints in different scenes, as shown in Figure 3b and 8. In these figures, features with different scene parameters cluster in distinct locations, validating the points we raised and indicating that the introduction of the Hypernet modules contributes to the generalizability of the NeRF framework.

Additional: Here, we further discuss "why" choosing HyperNet is superior to existing works. As mentioned in the Introduction, leveraging HyperNet allows us to impart generalizability into the NeRF-based framework. This is crucial when discussing the generalization of NeRF-related works, which has been overlooked in existing literature. Moreover, such structure enables us to replace the popular Transformer architecture, leading to a significant improvement in the inference efficiency of the network, as illustrated in Fig. 4.

To further reflect your comment, we have provided additional explanations for "why such a strategy or technical designs could help NeRF generalization" in Appendix D3. All changes are highlighted in BLUE font.

评论- Response for Reviewer 8PMb (2/2)

2023-11-15

C2. The experiments are very simple and short, in Figure 3, the methods are close in visualizations. As a prior and baseline work of generalizable work, IBRNet shows close performance to the proposed InsertNeRF.

We apologize for any inconvenience caused. Due to space limitations, we included only a few qualitative comparisons in Figure 1b and Figure 3 in the main text. A more extensive qualitative analysis has been supplemented in the Appendix and an additional video. In fact, our visualization experiments cover nearly ALL scenes in popular datasets such as NeRF Synthetic (Figure 10), LLFF (Figure 3, 9 and 11), and DTU (Figure 11) for generalizable NeRF.

For Performance with IBRNet, please allow us to explain this issue from the following three points:

i. Fristly, we believe that Quantitative Analysis can further illustrate the superiority of our approach over IBRNet, as shown in Table 1. Across three popular metrics, PSNR and LPIPS exhibit substantial enhancements by $\sim$ 1.27dB $\ \uparrow$ and $\sim$ 15.5 $\%\downarrow$ respectively, and SSIM also demonstrates an improvement on LLFF, which is significant for the GNeRF task.

ii. In Figure 3, we present qualitative comparison results on the LLFF dataset. The LLFF dataset is derived from real-world scenes, and the rendered objects exhibit complex geometry and texture. Consequently, differences in the visual results of LLFF often manifest in certain details as seen in [1][2]. For instance, in our Figure 3,

In the first row, noticeable discrepancies are observed in the leaves of the fern, where IBRNet introduces significant artifacts, and NeuRay and GNT exhibit partial geometric omissions. InsertNeRF effectively mitigates these issues.
In the second row, IBRNet shows the green chair's 'back', while NeuRay and GNT exhibit blurry geometry on the chair's 'surface'. In contrast, our InsertNeRF demonstrates superior rendering results in both geometry and appearance.
In the third row, existing methods fail to reconstruct complete geometric information of the leaves, while our approach produces results closest to the GroundTruth.
In the fourth row, IBRNet shows holes in the regions. Our method, compared to the ship structure constructed by NeuRay and GNT, is clearer.

iii. We also analyzed the rendering results of depth images, which are related to the reconstructed geometry. As shown in Figure 1b and 11, we achieved clear geometric reconstruction compared to IBRNet, further demonstrating the superiority of InsertNeRF.

To further reflect your comment, we reported representative scenes from different datasets, improving Figure 3a. This not only enhances the persuasiveness of our visual results but also serves to further substantiate the advantages of InsertNeRF. In addition, we maximized the utilization of the paper space to provide more visualization results in Fig. 3. However, due to space limitations, we have included more results in the appendix. All changes are highlighted in BLUE font.

We hope our responses have addressed the concerns of the reviewer and we are happy to answer any further questions.

[1]: Wang P et al. "IS ATTENTION ALL THAT NERF NEEDS?" ICLR 2023

[2]: Xu M et al. "Wavenerf: Wavelet-based generalizable neural radiance fields" ICCV 2023

评论- Follow up

2023-11-18

Dear Reviewer 8PMb,

Thank you for your time and effort in reading our response! We hope our response has addressed your concerns. If you still feel unclear or worried, please let us know; we'd be happy to further clarify and discuss any additional questions. If you feel your concerns have been addressed, please kindly consider if it is possible to update your score.

Thank you!

Paper693 Authors

评论- Follow up

2023-11-21

Dear Reviewer 8PMb,

We appreciate your time and effort in reading our response and revision! If you still have further concerns or feel unclear, please kindly let us know and we are happy to further clarify and discuss. If you feel your concerns have been addressed, we would appreciate it if you might kindly consider updating the score. As the discussion deadline is approaching, we really look forward to your feedback.

Thank you!

Paper693 Authors

评论- post-rebuttal comment

2023-11-22

Thanks for the responses.

I am satisfied with the answers to my questions, I hope the revisions will be added to the main paper. I have raised my scores.

评论- Thank you for your time and effort in reading our response!

2023-11-22

Dear Reviewer 8PMb

Thank you for your time and effort in reading our response! Due to space limitations, we'll strive to incorporate as much of the relevant content into the main paper (Sec.3.2 and Sec.4.2) based on your suggestions. If you have any additional concerns, please let us know and we will get back to you as soon as possible!

Thank you!

Paper 693 Authors

评论- General Response

2023-11-15

We would like to first thank the reviewers for the helpful suggestions and constructive reviews. We are greatly encouraged by their assessment of our work as useful (8PMb, pJTL, km3C), novelty (km3C), relevant (km3C), good writing and the presentation (km3C) and clearly demonstrating its contributions (8PMb, km3C), displaying state-of-the-art results (8PMb, pJTL, km3C) and potential in various NeRF-like systems (pJTL, km3C), solving the important problem of generalizable NeRF and achieving a very good and promising pipeline(km3C). We are grateful that they recognize our work for achieving state-of-the-art performance compared to current models and acknowledge the role of InsertNeRF in the generalization of NeRF tasks. We carefully address each concern given by reviewers with detailed explanations and supporting experimental results.

Following the insightful comments given by the reviewers to improve our work, we have made improvements to the revised version of our paper. The writing has been modified according to the reviewers' suggestions, focusing primarily on the presentation (8PMb, pJTL) and discussion of related works (pJTL). We summarize the updates below.

We have fixed additional explanations for ”why such a technical designs could help NeRF generalization” in Sec.3.2 and Appendix D3.
We have updated Figure 3 to provide more visualization results for a variety of scenes.
We added experiments to Table 2 to compare directly introducing the hypernetwork into the NeRF framework as Points2NeRF and HyperNeRFGAN.
We have added the related works about generative models and TriPlaneNeRF in Appendix B and HyperNeRFGAN in Sec.2.2
We have redesigned Figure 2 and updated the description for Figure 2.
We have added the description for λ1 and λ2 to the Appendix. A for experimental details.
We have added the experiments to the Appendix. E for ShapeNet.

We would be grateful if you could review the renewed version of our paper. If there are any comments that we did not adequately address despite the revision, they will be thoroughly reflected in the final version of our paper. Thank you for your helpful and constructive reviews.

AC 元评审

2023-12-10

Summary

The paper introduces “InsertNeRF,” which enhances the generalizability of NeRF using hypernetworks, which dynamically adjust the NeRF’s weights based on reference scenes. Specifically, the paper “inserts” a HyperNet module before each MLP layer of vanilla NeRF with two types of structures. The experiments show comparable or even better performance compared to existing methods.

Strength

The plug-and-play hyper-network design is clean, and the performance boost on generalizable NeRF seems effective based on the experiments.
The proposed design can be potentially used in various NeRF architectures, showing its generalizability.

Weaknesses

The insight behind “why the hypernetwork design is important and can help NeRF generalize” is unclear and missing. The authors’ explanation did not address well why this approach would be fundamentally better than methods like NeuRay, which also utilized scene representations.
The experiments lack the necessary details, especially when distinct from other existing approaches.
The presentation of the original version is not easy to follow, with confusing notations in equations, which was improved in the revision.
As raised by pJTL, some comparisons and explanations of prior works used hypernetworks and generalizable NeRF are missing.

为何不给更高分

Although km3C gave this paper a score of 10, the review itself appears somewhat superficial and lacking in depth. Initially, km3C justified the high score by asserting that the paper addresses the generalizable NeRF (Neural Radiance Fields) problem. However, the approach isn't novel; it largely mirrors established methods in generalizable NeRF like pixelNeRF, IBRNet, and NeuRay. The primary innovation of the proposed method lies in utilizing a hypernetwork for conditional generation.

From a broader perspective, the paper doesn't introduce a new problem setting. It merely substitutes the mechanism by which NeRF is conditioned based on reference images. After considering the reviewers’ concerns and the authors’ responses, the necessity of the hypernetwork, especially in comparison to IBRNet or NeuRay, remains ambiguous.

Furthermore, the paper lacks strong motivation and fails to clarify whether, with an equal number of parameters, the proposed HyperNet module significantly outperforms other conditioning methods in terms of generalizability. A more comprehensive ablation study may be required to substantiate its claims.

为何不给更低分

After the rebuttal phase, all reviewers gave positive scores. Two reviewers increased their initial judgment and were satisfied with the author's response.

最终决定Accept (poster)

2024-01-16

Accept (poster)