5.7

/10

Poster3 位审稿人

最低5最高6标准差0.5

3.0

置信度

正确性3.0

贡献度2.7

表达3.0

NeurIPS 2024

How to Use Diffusion Priors under Sparse Views?

Qisen Wang,Yifan Zhao,Jiawei Ma,Jia Li

OpenReview PDF

提交: 2024-04-25更新: 2024-11-06

TL;DR

A score distillation method for sparse views

摘要

关键词

Novel View SynthesisSparse ViewDiffusion Model3D Gaussian Splatting

评审与讨论

审稿意见

评分: 6置信度: 32024-07-04

This paper mainly investigates the behavior of SDS in sparse-view 3D reconstruction, pointing out that SDS may unexpectedly harm the 3D reconstruction quality in this case. Compared to SDS in text-to-3D, sparse-view reconstruction requires leveraging visual cues encoded in input images (named "inline prior"), while the naive SDS fails to do so. To fill this gap, this paper proposes to use a diffusion inpainting model, taking a warped input image as input to guide SDS optimization. This approach is indicated as Inline Prior Guided Score Matching (IPSM). In addition, a 3DGS pipeline incorporating this prior is introduced in the paper, which achieves SOTA performance.

优点

The perspective of this paper on the mode-seeking behavior of SDS in the sparse-view setting is interesting.
The results on two public datasets demonstrate the state-of-the-art performance of the proposed method.
Clear and detailed ablation studies are also provided.

缺点

Insufficient analysis of mode deviation. This paper analyzes the "mode deviation" problem of SDS mainly via Fig. 1 (empirical evidence) and Fig. 2 (intuitive explanations). While such evidence is appreciated, more in-depth analysis, such as theoretical analysis, would definitely help to make this statement more convincing. In Sec. 3.2 I cannot find math evidence as to why SDS cannot work, though it seems the authors wanted to show this.
Novelty of diffusion inpainting model. Leveraging a diffusion inpainting model with warped input images to guide novel views is interesting. However, this idea looks similar to that of [1] in a slightly different context (novel view synthesis w/o SDS). A detailed comparison with [1] in terms of methodology is needed.

References:
[1] Kant et al. iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis. SIGGRAPH Asia, 2023.

问题

This paper limits the discussion of diffusion prior (used in SDS) to text-to-image models, which may reduce its contributions and make its motivation less persuasive -- we indeed have a better choice to do SDS! Compared to T2I models, a more natural alternative is view-conditioned diffusion models, e.g., Zero-1-to-3 [1], which takes images as input. A recent work, ZeroNVS [2], has shown the success of SDS with such view-conditioned guidance using only a single input image. The second row in Tab. 2 (the first part of IPSM) is essentially an SDS baseline using the diffusion inpainting model, showing improvements for all metrics. This indicates that $*an appropriate selection of SDS guidance can avoid the problem stated in the paper*$ .
Based on 1, additional results as follows would be highly appreciated.
(1) View-conditioned SDS baseline leveraging Zero-1-to-3 style image-based models, similar to the second row in Tab. 2 of the paper.
This aims to investigate if view-conditioned diffusion models for SDS also suffer mode deviation.

References:
[1] Liu et al. Zero-1-to-3: Zero-shot One Image to 3D Object. ICCV 2023.
[2] Sargent et al. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image. CVPR 2024.

局限性

The authors have discussed their limitations in Sec. 5.

作者回复

2024-08-07

Thanks for the careful review. We appreciate for concerns and valuable suggestions and questions. Here are our corresponding responses.

Analysis of the motivation. With detailed discussion about mode-seeking in previous works [A, B], here we provide a theoretical analysis of our motivation. The optimization objective of $\min _ {\theta} \mathbb{E} _ {t, \mathbf{v} _ j}[ \omega(t) D _ {KL}({q _ t^{\theta}(\mathbf{x} _ t^j)} \Vert{p _ t^*(\mathbf{x} _ t^j)})]$ derives ${q _ t^{\theta}(\mathbf{x} _ t^j)} \sim \mathcal{N}(\mathbf{x} _ t^j; \sqrt{\bar{\alpha} _ t}\mathbf{x} _ 0^j, (1 - \bar{\alpha} _ t)\mathbf{I})$ to the high-density region of $p _ t^*(\mathbf{x} _ t^j)$ . Considering two mode $\mathbf{m} _ 1, \mathbf{m} _ 2 \in \mathcal{M}(\mathbf{x} _ t^j, y)$ of $p _ t^*(\mathbf{x} _ t^j)$ , where $\mathbf{m} _ 1$ is the target mode and $\mathbf{m} _ 2$ is the failure mode. $\mathcal{M}$ is the mode range of $p _ t^*(\mathbf{x} _ t^j)$ decided by $\mathbf{x} _ t^j$ and $y$ , which is the text prompt, and we do not elaborate on the conditions of the diffusion prior distribution for brevity. We denote the distance of two modes as $D _ M = \Vert \mathbf{m} _ 1 - \mathbf{m} _ 2 \Vert _ 2$ . We want $\sqrt{\bar{\alpha} _ t}\mathbf{x} _ 0^j \approx \sqrt{\bar{\alpha} _ t}\mathbf{m} _ 1$ for any $t$ . When $t$ is small, i.e. $t \rightarrow 0$ , we have $\sqrt{\bar{\alpha} _ t} \rightarrow 1$ , thus it is not hard for deriving $\sqrt{\bar{\alpha} _ t}\mathbf{x} _ 0^j$ to $\sqrt{\bar{\alpha} _ t}\mathbf{m} _ 1$ . However, when $t$ is large, i.e. $t \rightarrow 1$ , we have $\sqrt{\bar{\alpha} _ t} \rightarrow 0$ , thus we have $\sqrt{\bar{\alpha} _ t}\mathbf{x} _ 0^j \rightarrow 0, \sqrt{\bar{\alpha} _ t}\mathbf{m} _ 1 \rightarrow 0, \sqrt{\bar{\alpha} _ t}\mathbf{m} _ 2 \rightarrow 0$ . So, the optimization direction will still be affected, resulting in mode deviation. Back to the proposed IPSM, since IPSM introduces the rectified distribution $\tilde{q} _ t^{\theta, \phi}(\mathbf{x} _ t^j)$ as the intermediate state for narrowing the mode range $\mathcal{M}^{'}(\mathbf{x} _ t^j, y, \mathbf{M}^{i \rightarrow j} \odot \mathbf{I} _ 0^{i \rightarrow j}, \mathbf{M}^{i \rightarrow j})$ , optimization directions are constrained and guided, thus suppressing mode deviation and promoting reconstruction quality.
Comparison with iNVS [C]. TLDR of core difference: iNVS uses re-projection and Inpainting Stable Diffusion (ISD) for constructing a better way of conditioning for the diffusion model, while ours is for building an intermediate distribution to guide SDS optimization of 3D representations. Difference: iNVS aims to improve the conditional encoding of diffusion priors for image-to-3D. Unlike Zero-1-to-3 [D], which directly encodes the input image and the pose of novel views, iNVS uses the characteristics of ISD and re-projection techniques to encode the geometric prior of the input image on novel views into the conditional diffusion model, and fine-tunes ISD on large-scale external 3D dataset, thereby further improving object-consistency in generated images of novel views. This work aims to lift 3D information from inline prior for boosting SDS in sparse-view reconstruction. Unlike iNVS, this work starts from the analysis of the optimizing objective of SDS, and builds an intermediate distribution between the diffusion prior and the rendered images distribution by guiding the sampling trajectory of the pre-trained ISD with inline priors, thereby suppressing mode deviation and promoting improvements in reconstruction.
SDS using view-conditioned diffusion prior.
- An intuitive evaluation of Zero-1-to-3 on objects and scenes. As shown in Fig. 2 of the attachment, we provide the novel-view results of Zero-1-to-3 when given an image including an object and a scene. With the change of azimuth, Zero-1-to-3 can generate satisfactory results of objects in different new perspectives, but not for scenes, i.e. always fixed at the input perspective. This is because Zero-1-to-3 is fine-tuned on the large-scale 3D object dataset Objaverse [E]. During the optimization process of Zero-1-to-3, inherent inductive biases about objects are introduced into the model, making it hard to produce satisfactory results from new perspectives on scenes.
- Quantitative experimental results of SDS using Zero-1-to-3. We product experiments of SDS using Zero-1-to-3 on LLFF with 3 input views shown below. Please note that Zero-1-to-3 controls the camera position by controlling polar, azimuth, and radius, but does not provide a way to control the camera posture. So we do not add noise to the camera posture for generating pseudo views when using Zero-1-to-3. We can notice that Zero-1-to-3 added as a prior cannot improve the quality of scene reconstruction. This is consistent with the intuitive visualization results we mentioned above. 3D diffusion priors lack a large amount of 3D data for learning the data distribution of the real 3D world, making it difficult to produce simple and direct effects on sparse-view reconstruction.

Table 1. Quantitative experimental results of SDS.

Setting	SSIM	LPIPS	PSNR	AVGE
Base	0.625	0.254	19.00	0.125
w/ SDS(Zero-1-to-3, CFG=3.0 default)	0.566	0.361	17.65	0.160
w/ SDS(SD, CFG=7.5)	0.647	0.267	18.80	0.128
w/ SDS(SD, CFG=100)	0.576	0.367	17.53	0.162
w/ SDS(ISD, CFG=7.5)	0.636	0.245	19.22	0.121
w/ IPSM(CFG=7.5)	0.670	0.229	19.60	0.113

References:

[A] DreamFusion: Text-to-3D using 2D Diffusion. ICLR 2022.

[B] Stable score distillation for high-quality 3d generation. arXiv:2312.09305, 2023.

[C] iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis. SIGGRAPH Asia, 2023.

[D] Zero-1-to-3: Zero-shot one image to 3d object. CVPR 2023.

[E] Objaverse: A universe of annotated 3d objects. CVPR 2023.

2024-08-10

Thanks for your efforts to prepare the response. However, my concerns remain in the following aspects:

The provided theoretical analysis of the proposed method is unclear to me. For example, why does $\sqrt{\bar{\alpha} _ t} \rightarrow 1$ indicate that it is easy to push the distribution of rendered images to the target mode? It's also not clear as to why "optimization direction will still be affected" when $t \rightarrow 1$ ?
I apologize if I was unclear regarding the view-conditioned SDS baseline. Indeed, it is not surprising to me that vanilla Zero-123 may not work in the proposed setup, where it only learned object-centric novel view priors. I mentioned ZeroNVS [1] in my review, which indeed used a fine-tuned version of Zero-123 to learn scene-level priors. Given the clear distribution gap between the object-centric and scene-level data, experiments on this modified Zero-123 are desired.
The authors did not answer my first question. Even in the ablation study (row 2, table 2 in the main paper), it turns out that using SDS in a proper way (i.e., with the proposed diffusion inpainting model) does help improve the NVS metrics. This is not compatible with the authors' current claim that SDS cannot work in sparse views. I would like to suggest the authors state clearly in terms of the scope of discussion for SDS. Explicitly mentioning the authors' focus on text-to-image models may help resolve this issue, though their contributions would be reduced.

While the proposed approach shows promising improvements over baselines, my primary concern is still the authors' overclaim about the scope of the problem. Given these reasons, I may not change my original score.

[1] Sargent et al. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image. CVPR 2024.

评论- Discussion scope of SDS and further discussion

2024-08-12

Thanks for your careful reviewing, Please note that this part of the response is coherent with the response above, but separated by character limit.

Clarifying of the discussion scope for SDS. Thanks for the valuable suggestion about the scope of discussion for SDS in this paper. We will clarify the scope of the SDS analysis discussion in the final version of the paper, i.e. the SDS experimental results discussed in this paper mainly use the text-to-image model. Furthermore, thanks to your concerns and additional suggestions on inpainting and 3D diffusion priors, we will also add a discussion about using inpainting diffusion and 3D diffusion priors as follows.
- Discussion about using 3D diffusion priors. We agree that currently 3D diffusion priors already have a certain ability to represent the 3D world. However, as reported experimental results in Tab. 1 and 2, 3D diffusion priors still cannot provide a significant boost on different 3D scene datasets, since the scarcity of 3D annotation data used to fine-tune 3D diffusion priors still exists. Specifically, ZeroNVS [A] fine-tuned on a mixture dataset consisting of CO3D [B] (50 MS-COCO categories, 1.5 million camera-annotated frames), ACID [C] (10 categories of construction machines, 15767 construction machine objects), and RealEstate10K [D] (10 million frames with camera poses). But, Stable Diffusion [E] and its inpainting version are trained on LAION-5B [F] (5.85 billion CLIP-filtered image-text pairs). With the additional conducted experiments, we notice that these 3D priors can only provide diffusion guidance with limited visual fidelity, especially for scenes. We do not mean to disparage 3D diffusion priors. On the contrary, we expect 3D priors to provide effective guidance to 3D scene optimization. However, there is still an objective fact that 3D training data for 3D diffusion models is scarce. How to efficiently construct high-fidelity 3D data, or how to use 2D data knowledge to complement the training of 3D diffusion prior remains a core challenge in this field.
- Discussion about using inpainting diffusion priors guided by inline priors. As shown in Tab. 2 of the main manuscript, SDS using inpainting diffusion prior shows improvements in NVS metrics. This is because we make reasonable utilization of the 3D inline prior contained in sparse views, which provides guidance for the sampling of inpainting diffusion models, thereby improving the reconstruction quality. Since Stable Diffusion [E] and ZeroNVS [A] cannot reasonably introduce the inline prior obtained by re-projection to guide sampling, they cannot provide effective diffusion guidance for optimization. However, it is difficult to use this approach (re-projection techniques and inpainting diffusion priors) in the original single-view reconstruction setting of ZeroNVS [A], since given single image, we have absolutely no 3D information about the scene unless using external monocular depth estimation. For the single image situation, we have to use diffusion models like ZeroNVS that are pre-trained with external 3D data to provide direct 3D guidance. Back to the sparse-view setting, we aim to utilize the information contained in sparse views different from the text or a single image (setting for most score distillation methods) to make up for the shortcomings of score distillation in this field, so we propose the method IPSM in this paper.

[A] ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR 2024.

[B] Infinite nature: Perpetual view generation of natural scenes from a single image. ICCV 2021.

[C] Reconfusion: 3d reconstruction with diffusion priors. CVPR 2024.

[D] Stereo magnification: Learning view synthesis using multiplane images. SIGGRAPH 2018.

[E] High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.

[F] Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS 2022.

2024-08-12

I once again appreciate the authors' detailed response, and I would like to raise my score to 6.

The additional experiments addressed most of my concerns. They demonstrate that the SDS leveraging the readily available view-conditioned diffusion models (e.g., scene-finetuned Zero123) does not work as well as the proposed approach. This makes statements in the paper more convincing.

Finally, I would like to have a few comments that I hope are useful to improve the quality of the work:

The authors may clearly indicate the scope of discussion for SDS and may also incorporate view-conditioned baselines (e.g., scene-level Zero123) into experiments and discussions. The performance of the proposed method on text-to-image models is convincing to me.
I agree with the authors that one advantage of the proposed approach is to leverage the strong natural image priors from Stable Diffusion. However, this also makes the comparison with view-conditioned SDS a bit "unfair" -- what if our view-conditioned model is fine-tuned with that large scale of scene images? As a more fair comparison cannot be made possible for practical reasons, I suggest the authors make careful claims regarding the effectiveness of (view-conditioned) SDS. Specifically, I may attribute the unexpected performance of view-conditioned SDS more to the domain gap (the fine-tuning data for Zero123 v.s. the datasets used for evaluation) than the SDS approach itself.
Thank you for clarifying theoretical motivation. It makes sense to me that different ts make target and failure modes easily separate or indistinguishable. However, this seems not quite relevant to the sparse-view setup but more like a common property of SDS. The authors may consider revising the writing accordingly to clarify motivations.

2024-08-12

Thanks for the careful review and the recognition of our work. We appreciate the valuable suggestions that make the paper stronger.

Following your comments, we will improve this paper in the final version from three aspects. (1) Clarifying the discussion scope of SDS. We will clarify the experimental results of SDS in the main manuscript are mainly from text-to-image priors and further involve discussion of inpainting priors and 3D priors like Zero-1-to-3 and ZeroNVS with additional experimental results. (2) Careful claims regarding SDS with view-conditioned diffusion priors. We agree that the unsatisfactory performance of SDS using view-conditioned 3D priors cannot be mainly attributed to SDS, but to the domain gap from 3D priors indeed. We will clearly claim it in the supplemented discussion of (1). (3) Theoretical analysis of motivation. In addition to experimental evidence and intuition visualization, we will expand the part of analysis for SDS in Sec. 3.2 and provide more theoretical analysis on motivation to further highlight our core contribution.

Thank you again for your efforts and suggestions throughout the review process.

评论- Theoretical analysis and experimental results using ZeroNVS

2024-08-12

Thanks for your patience while reviewing our response to the questions and concerns. We provide the following supplementary responses to the relevant questions.

Theoretical analysis. We apologize for not elaborating on the theoretical analysis due to the character and time limit of the rebuttal. We polish it and supplement the theoretical analysis from that we want $\sqrt{\bar{\alpha} _ t} \mathbf{x} _ 0 ^ j \approx \sqrt{\bar{\alpha} _ t} \mathbf{m} _ 1$ for any $t$ . Keep the distance of two modes is $D _ M = \Vert \mathbf{m} _ 1 - \mathbf{m} _ 2 \Vert _ 2$ in our mind. Trivally, we have $D _ M \in \mathbb{R} ^ +$ and it takes the constant bound $C _ 1$ and $C _ 2$ with $C _ 1 < D _ M < C _ 2$ where $C _ 1 < C _ 2 \in \mathbb{R} ^ +$ . When $t$ is small, for simplicity, we can consider it as $t \rightarrow 0$ and $\sqrt{\bar{\alpha} _ t} \rightarrow 1$ , it can be recognized that there is a significant gap between the two modes $\Vert \sqrt{\bar{\alpha} _ t} \mathbf{m} _ 1 - \sqrt{\bar{\alpha} _ t} \mathbf{m} _ 2 \Vert _ 2 = \sqrt{\bar{\alpha} _ t} D _ M \approx D _ M > C _ 1$ during optimization. Thus we claim that "it is easy to push the distribution of rendered images to the target mode". When $t$ is large, for simplicity, we can consider it as $t \rightarrow 1$ and $\sqrt{\bar{\alpha} _ t} \rightarrow 0$ , since we have $D _ M < C _ 2$ , the gap between two modes becomes unclear, i.e. $\Vert \sqrt{\bar{\alpha} _ t} \mathbf{m} _ 1 - \sqrt{\bar{\alpha} _ t} \mathbf{m} _ 2 \Vert _ 2 = \sqrt{\bar{\alpha} _ t} D _ M \rightarrow 0$ , which results in the mode aliasing for optimization and further affects the optmizing direction during training. Therefore, the distribution of rendered images is not constrained to seeking the target mode, causing mode deviation.
Experiments using ZeroNVS. The experimental results using ZeroNVS [A] as provided as follows.
- Additional ablation experiments. As shown in Tab. 1, using ZeroNVS [A] is superior compared to using Zero-1-to-3 [B] since the former utilizes 3D annotated scene data for fine-tuning. However, although ZeroNVS [A] as 3D prior can achieve stunning results in single-view reconstruction for inferring 3D structure from an unlabeled 2D image, it still cannot boost the sparse-view reconstruction quality as IPSM since the ZeroNVS guidance does not exploit inline priors for sparse views which is different from the single-view setting. As shown in the demo of ZeroNVS [A], it takes limited visual fidelity, which can be attributed to the inherited inductive biases from the limited training data. We also generate the predicted final images of ZeroNVS on sampled pseudo views (however, it seems that we cannot upload more images in the discussion period), and we find that the generation images of ZeroNVS for novel views on LLFF dataset are blurred and over-saturated, which is consistent with the quantitative experimental results in Tab. 1.
- Additional quantitative comparison under 3-view setting. Additional experimental results of ZeroNVS are reported in ReconFusion [C] as shown in Tab. 2. The reported experiments are conducted with modifying the original way of sampling random views to using the input view closest to the sampled pseudo views for adapting the 3-view setting. It is supposed to be noted that ZeroNVS * in Tab. 2 is based on the full reconstruction pipeline which not only uses SDS loss but also other losses, e.g. loss for normal smoothness, while ours (Base + IPSM) in Tab. 2 only uses the IPSM loss.

Table 1. Quantitative experimental results of different SDS on the LLFF dataset with 3-view setting.

Setting	SSIM	LPIPS	PSNR	AVGE
Base	0.625	0.254	19.00	0.125
w/ SDS(Zero-1-to-3, CFG=3.0 default)	0.566	0.361	17.65	0.160
w/ SDS(ZeroNVS, CFG=7.5 default)	0.639	0.289	19.12	0.129
w/ SDS(SD, CFG=7.5)	0.647	0.267	18.80	0.128
w/ SDS(SD, CFG=100)	0.576	0.367	17.53	0.162
w/ SDS(ISD, CFG=7.5)	0.636	0.245	19.22	0.121
w/ IPSM(CFG=7.5)	0.670	0.229	19.60	0.113

Table 2. Quantitative experimental results of ZeroNVS on the LLFF dataset with 3-view setting.

Setting	SSIM	LPIPS	PSNR	AVGE
ZeroNVS *	0.359	0.512	15.91	0.219
Base + SDS(ZeroNVS, CFG=7.5 default)	0.639	0.289	19.12	0.129
Base + IPSM(CFG=7.5)	0.670	0.229	19.60	0.113

*: The reported experimental results of ZeroNVS is from the reported results from ReconFusion. The vanilla ZeroNVS is based on NeRF and also utilizes other losses except of SDS.

[A] ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image. CVPR 2024.

[B] Zero-1-to-3: Zero-shot one image to 3d object. CVPR 2023.

[C] Reconfusion: 3d reconstruction with diffusion priors. CVPR 2024.

审稿意见

评分: 5置信度: 32024-07-08

This paper introduces a novel approach for synthesizing novel views from sparse view inputs using diffusion priors. The authors conduct a thorough analysis of SD optimization under sparse views and propose an inline prior guided score matching algorithm to rectify the distribution of rendered images. The 3DGS is chosen as the 3D representation and rendering method to incorporate IPSM for generating novel view images. The experimental results demonstrate that the proposed method significantly improves the results of novel view synthesis when given sparse views on several commonly used benchmarks.

优点

The authors conduct a thorough analysis of SDS algorithms conditioned on sparse view images.
The proposed IPSM method is both novel and reasonable.
The paper is generally well-written, with clear and concise explanations

缺点

The paper solely compares the quantitative results of the proposed method with SDS, without providing qualitative results for SDS.
The experimental settings lack detailed information. For instance, the level of sparsity in the input views, the size of baselines in the input views, and the size of baselines between input and output novel views are not clearly specified.
The paper fails to clarify the necessity of using 3DGS.

问题

The authors should address the questions raised in the Weaknesses section to provide a more comprehensive understanding of their work.
It would be beneficial for the authors to discuss how the proposed method performs in extreme circumstances, such as when the two input views are opposite.
It is important to clarify whether the proposed method can handle extrapolation scenarios. For instance, how does it perform when the azimuths of the two input views are 0 degrees and 90 degrees, and the azimuths of the new views are 180 degrees?
It would be interesting to explore whether the 3DGS can be replaced with NeRF and how the replacement influnces the performance.

局限性

Yes, the authors discuss the limitation.

作者回复

2024-08-07

Thanks for the efforts and patience of the careful reviewing. We appreciate the suggestions and questions for this paper. Here we provide detailed responses.

Qualitative results for SDS. As shown in Fig.1 (a) of the attachment, the guidance of SDS will produce the imaginary reconstruction caused by mode deviation when using the diffusion prior directly as we demonstrated in the main manuscript. This property is reasonable and acceptable in text-to-3D generation tasks, but it fails in specific scene reconstructions limited by sparse views. As shown in Fig.1 (b) of the attachment, we can observe that SDS will also produce large floaters during optimization, which indicates the characteristic of training instability of SDS, since SDS overlooks the inline prior of sparse views and is hard to provide stable guidance towards target mode. We provide more qualitative comparison with SDS in the video of the supplementary material.
Detailed experimental settings. Here we provide detailed experimental settings below. It is supposed to be noted that all baselines optimized per scene share the identical protocol of the experimental settings.
- LLFF Dataset includes 8 scenes in total. For each scene of the LLFF dataset, following RegNeRF [A], we choose every 8th image as the test set and use 3 views uniformly sampled from the remaining images. The original resolution of the LLFF dataset is $4032 \times 3024$ . Following DNGaussian [B], we downsample the resolutions of images to $8\times$ for both training and testing. In Tab. 1 of the attachment, we report the level of sparsity.
- DTU Dataset includes 124 scenes in total. The prevailing pre-training method like PixelNeRF [C] utilizes 88 scenes for training and 15 test scenes, i.e. IDs: [8, 21, 30, 31, 34, 38, 40, 41, 45, 55, 63, 82, 103, 110, 114], for per-scene fine-tuning and testing. Following the prevailing works of optimizing per scene, i.e. RegNeRF [A], and DNGaussian [B], we directly optimize our model per scene on the 15 test scenes. For each scene of the DTU dataset, following RegNeRF [A], the IDs of 3 input training views are: [25, 22, 28], and the IDs of test views are: [1, 2, 9, 10, 11, 12, 14, 15, 23, 24, 26, 27, 29, 30, 31, 32, 33, 34, 35, 41, 42, 43, 45, 46, 47]. The original resolution of the DTU dataset is $1600 \times 1200$ . Following RegNeRF [A], we downsample the resolutions of images to $4\times$ for both training and testing.
Extreme circumstances. For the sparse-view scene reconstruction task, researchers focus on using inline priors and external priors to suppress the overfitting issue. Therefore both methods on the mentioned extreme circumstances have poor effects. We construct corresponding data and conduct experiments with the state-of-the-art method DNGaussian [B].
- Two opposite input views. We select 2 opposite views of each scene on the MipNeRF-360 dataset, i.e. the IDs of training views of each scene: [2, 26] of bicycle; [22, 151] of bonsai; [57, 185] of counter; [1, 57] of garden; [14, 171] of kitchen; [2, 79] of room; [26, 34] of stump. The test views are selected every 8th image following Mip-NeRF. The quantitative comparisons with state-of-the-art methods DNGaussian and FSGS are shown in Tab. 2 of the attachment. We report the PSNR, SSIM, LPIPS, and AVGE for each scene and the average of all scenes. It can be seen that our method outperforms the state-of-the-art method DNGaussian on every scene and our model achieves improvements of $22.03$ %, $19.21$ % on average PSNR and AVGE scores respectively.
- Extrapolation scenarios. We select 2 views on 0 and 90 degrees of each scene on the MipNeRF-360 dataset, i.e. IDs: [2, 14] of bicycle; [22, 248] of bonsai; [57, 145] of counter; [1, 15] of garden; [14, 37] of kitchen; [2, 291] of room; [26, 28] of stump. The test views are selected on the 180 degrees, i.e. IDs: [26] of bicycle; [151] of bonsai; [185] of counter; [57] of garden; [171] of kitchen; [79] of room; [34] of stump. The quantitative results similar to Tab.2 are shown in Tab. 3 of the attachment. It can be seen that our method outperforms the state-of-the-art method DNGaussian on every scene and our model achieves the improvements of $25.78$ %, $21.34$ % on average PSNR and AVGE scores respectively.
3DGS as the backbone.
- Reason for choosing 3DGS. NeRF, as an implicit 3D representation, has the advantages of photographic-realistic rendering quality, but it takes the issues of slow training and rendering speeds. 3DGS, as a new explicit 3D representation, has the advantages of fast training and rendering speeds, which provides researchers with a convenient 3D representation framework, which is why we chose 3DGS as the backbone.
- How about NeRF? In text-to-3D and image-to-3D tasks, score distillation methods are widely used on different backbones. For example, the representative methods using NeRF as the backbone include Dreamfusion [D]; the representative methods using 3DGS as the backbone include DreamGaussian [E]. Since this paper proposes a score distillation method, it can theoretically be migrated to NeRF, just like VSD [H] can be applied to both NeRF and 3DGS.

[A] M. Niemeyer, J. T. Barron, B. Mildenhall, M. S. Sajjadi, A. Geiger, and N. Radwan, “Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs,” CVPR 2022.

[B] J. Li, J. Zhang, X. Bai, J. Zheng, X. Ning, J. Zhou, and L. Gu, “Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization,” CVPR 2024.

[C] A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” CVPR 2021.

[D] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” ICLR 2022.

[E] J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” ICLR, 2023.

审稿意见

评分: 6置信度: 32024-07-13

This paper deals with the problem of novel view synthesis from a sparse set of input views. While this problem has been tackled with depth or semantic regularization in the past, the authors approach the problem by introducing priors from a pre-trained diffusion model following a few recent works like ReconFusion. However, these recent works require fine-tuning a diffusion model on multi-view data and do not directly apply pre-trained diffusion models as is common practice in text-to-3D, e.g. via an SDS objective. The authors observe that this is because when using SDS in a straightforward manner, the performance of the model decreases under standard CFG settings. Given their observation, they propose IPSM which decomposes SDS into two sub-objectives. The reasoning is that we can guide the SDS process by additionally supervising the predicted noise of the target view to be close to the noise predicted from an inpainted version of the target view obtained through re-projection and infilling. In addition to the IPSM objective, the authors supervise the model with a depth loss and an image-based rendering loss (using the same process as the re-projection in IPSM).

优点

The paper provides a simpler alternative to ReconFusion, i.e. the geometric consistency is promoted through a simple reprojection-based guidance instead of a PixelNeRF that needs to be jointly trained with the diffusion model on external data.
The paper writing is good, i.e. clear motivation, the related work focuses on dissecting the difference of existing work to this one, and clear method section.
The experimental results show convincing improvements over the 3DGS baseline, naive SDS, and other few-shot methods such as FSGS

缺点

Ablation in Tab. 2: It is great to see that using both objectives in IPSM is superior, however, it would be great to see a version without IPSM and with $\mathcal{L}^\text{geo}$ and $\mathcal{L}^\text{depth}$ to judge if IPSM is needed or if the geometric regularization alone is sufficient.
The experimental setup focuses on NVS from 3 input views. However, the method should be suitable for arbitrary sparse view setups. It would be interesting to investigate if the improvements hold with 6/9 views (ReconFusion setup) or even more. This would broaden the application scenarios of the method. It would be very interesting to see if this simpler approach can rival the performance of the more complex ReconFusion pipeline given their data is released.
Fig. 3 is broken in the paper. Fortunately, the supplementary video also shows the figure.

问题

I hope the authors can mainly answer the first question I listed in weaknesses, but I think also addressing the second concern could make the paper stronger.

局限性

The limitations were sufficiently discussed in the main paper.

作者回复

2024-08-07

Thanks for your efforts and patience in reviewing this paper. We appreciate the positive comments, valuable concerns, and suggestions on our work. Here are our responses to the mentioned weaknesses and questions.

Additional ablation study. To supplement more complete experimental results, we provide additional ablation study using 3 views on the LLFF and DTU dataset in Tab. 1 and 2 respectively. We can see that $\mathcal{L} _ {\rm{depth}}$ presents a strong prior for optimization since it directly provides the 3D geometric guidance on 3D representations. Notably, although both $\mathcal{L} _ {\rm{geo}}$ and $\mathcal{L} _ {\rm{IPSM}}$ use re-projection techniques to introduce the 2D visual prior information of the sparse views to promote optimization, $\mathcal{L} _ {\rm{IPSM}}$ achieves satisfactory performance comparable to direct 3D guidance of $\mathcal{L} _ {\rm{depth}}$ as shown in Tab. 2. At the same time, it is difficult for $\mathcal{L} _ {\rm{geo}}$ to promote optimization independently without the assistance of other regularizations.

Table 1: Additional Ablation Study on the LLFF dataset with 3-view setting

Setting	SSIM	LPIPS	PSNR	AVGE
Base	0.625	0.254	19.00	0.125
Base + $\mathcal{L} _ {\rm{depth}}$	0.687	0.212	20.08	0.105
Base + $\mathcal{L} _ {\rm{geo}}$	0.651	0.235	19.35	0.117
Base + $\mathcal{L} _ {\rm{IPSM}}$	0.670	0.229	19.60	0.113
Base + $\mathcal{L} _ {\rm{IPSM}}$ + $\mathcal{L} _ {\rm{depth}}$ + $\mathcal{L} _ {\rm{geo}}$	0.702	0.207	20.44	0.101

Table 2: Additional Ablation Study on the DTU dataset with 3-views setting.

Setting	SSIM	LPIPS	PSNR	AVGE
Base	0.836	0.134	19.11	0.087
Base + $\mathcal{L} _ {\rm{depth}}$	0.849	0.122	19.77	0.079
Base + $\mathcal{L} _ {\rm{geo}}$	0.835	0.135	19.28	0.086
Base + $\mathcal{L} _ {\rm{IPSM}}$	0.853	0.122	19.67	0.080
Base + $\mathcal{L} _ {\rm{IPSM}}$ + $\mathcal{L} _ {\rm{depth}}$ + $\mathcal{L} _ {\rm{geo}}$	0.856	0.121	19.99	0.077

Additional experimental results with more input views. Thanks for the concern on the performance of our method with more input views. Experimental results using more input views can further explore the robustness of our method when working with sparse views. We provide additional experimental results under 6 and 9 input views on the LLFF dataset in Tab. 3 and 4 respectively. Denote that * represents results reported in ReconFusion and # represents results reported in DNGaussian. Notably, our method uses exactly the same parameters as the LLFF dataset with 3 views for training.
- 6 input views. As shown in the first Table below, we achieve an improvement of $11.18$ % on LPIPS compared to ReconFusion. It is supposed to be noted that ReconFusion requires additional computational resources for pre-training an encoder with external data as we demonstrated in the main manuscript. Excluding methods that require additional resources for pre-training, our method achieves improvements of $8.12$ %, $8.34$ %, $31.82$ %, $30.68$ % on PSNR, SSIM, LPIPS, and AVGE respectively, compared to DNGaussian, which is the state-of-the-art method based on the 3DGS.
- 9 input views. Similar to the experimental results of 6 input views, our method still outperforms all state-of-the-art methods on SSIM, LPIPS, and AVGE scores and achieves comparable results on PSNR. As shown in Tab.2 of the attachment, compared to 3DGS-based DNGaussian, we achieve improvements of $7.94$ %, $8.38$ %, $26.11$ %, $33.77$ % on PSNR, SSIM, LPIPS, and AVGE respectively.

Table 3: Quantitative comparisons with 6 input views on the LLFF dataset

Method	Pub.	Pretrain	PSNR	SSIM	LPIPS	AVGE
Zip-NeRF *	ICCV 2023	-	20.71	0.764	0.221	0.097
RegNeRF *	CVPR 2023	-	23.09	0.760	0.243	0.084
DiffusioNeRF *	CVPR 2023	✔	23.60	0.775	0.235	0.079
FreeNeRF *	CVPR 2023	-	23.72	0.773	0.232	0.078
SimpleNeRF *	SIGGRAPH Asia 2023	-	23.05	0.737	0.296	0.091
ReconFusion *	CVPR 2024	✔	24.25	0.815	0.152	0.063
3DGS #	SIGGRAPH 2023	-	20.63	0.699	0.226	0.108
DNGaussian #	CVPR 2024	-	22.18	0.755	0.198	0.088
Ours	-	-	23.98	0.818	0.135	0.061

Table 4: Quantitative comparisons with 9 input views on the LLFF dataset

Method	Pub.	Pretrain	PSNR	SSIM	LPIPS	AVGE
Zip-NeRF *	ICCV 2023	-	23.63	0.830	0.166	0.067
RegNeRF *	CVPR 2023	-	24.84	0.820	0.196	0.065
DiffusioNeRF *	CVPR 2023	✔	24.62	0.807	0.216	0.069
FreeNeRF *	CVPR 2023	-	25.12	0.820	0.193	0.063
SimpleNeRF *	SIGGRAPH Asia 2023	-	23.98	0.762	0.286	0.082
ReconFusion *	CVPR 2024	✔	25.21	0.848	0.134	0.054
3DGS #	SIGGRAPH 2023	-	20.44	0.697	0.230	0.108
DNGaussian #	CVPR 2024	-	23.17	0.788	0.180	0.077
Ours	-	-	25.01	0.854	0.133	0.051

Broken figure of the pipeline. We are sorry that Fig. 3 in the main manuscript may be broken on different PDF viewers. On Google Chrome, Fig. 3 can be viewed normally. We will replace it with an image file.

作者回复

2024-08-07

We thank all ACs and reviewers for their efforts in reviewing, valuable comments and suggestions for this paper. We addressed the reviewer's comments and questions in individual responses to each reviewer and provided supplementary figures and tables in the one-page pdf attachment.

最终决定Accept (poster)

2024-09-25

Pre-rebuttal, the authors liked the general framework of this work for its simplicity and clear motivation, highlighting the writing quality, method and results. Some concerns were the diversity of results and the lack of theoretical justification of mode seeking behavior. The paper received positively-leaning borderline scores. The authors prepared a strong rebuttal in my and the reviewers opinion, addressing the mentioned concerns, which led to convergence on accept scores. I agree with this assessment and therefore recommend to accept this paper.

I encourage the authors to add the additional results to the paper, as they clearly strengthen the work.