3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors
We propose a method that exploits view-consistent 2D generative priors, i.e., a video diffusion model, to enhance 3D Gaussian splatting rendering quality.
摘要
评审与讨论
The paper tackles the problem of enhancing 3GS novel views which are far from existing viewpoints (encountered in sparse-view settings), due to insufficient information in under-sampled areas. This work enhances the low quality views using a video diffusion model to maintain multi view and temporal consistency. The authors reformulate the task of 3D-consistent image restoration as video restoration which is again reformulated as a video interpolation tasks between frames. The high quality frames along with the restored images are then used to fine tune the 3DGS. The authors also propose a modified spatial temporal decoder (STD) to reduce artifacts and introduce confidence awareness at both image level (based on distance of novel view from reference) and pixel level (information gain) to minimize the negative impact of generated images.
优点
- The paper is well motivated- enhancing low quality 3DGS rendering results for camera viewpoints which are far away
- Reformulates the 3D-consistent image restoration task in 3DGS as temporally consistent video generation to leverage video diffusion models
- The results look quite impressive and have been tested on a variety of scenes
- can be applied to any existing 3DGS models and more generalized scenes
- introduces a dataset for evaluation of artifacts in 3DGS (will this dataset be open sourced?)
- Sound experimental setup and adequate comparisons with other methods
缺点
- Related work appears to be thin and some of these paragraphs can be more descriptive
- Method relies on adjacent views for continuous interpolation
- Choice for STD should be quantitatively ablated. Individual parts in STD like color correction should be ablated as well.
问题
-
Authors describe in L184-186 "Despite this significant enhancement in the quality of the rendered views, we propose to rely more on the reference views rather than the restored novel views when fine-tuning the 3DGS model, since the 3DGS model is highly sensitive to slight inaccuracies in the restored views." I would appreciate a more thorough discussion on this.
-
"modified baselines denoted with ∗ taken from [36]". However, no baseline seems to be marked with * in table 1. This is possibly an oversight and the authors should correct it.
-
What's the adversarial loss described in L179 for STD?
-
On what data is the Temporal Denosing U-Net fine tuned on?
Suggestions: Figure 2 can be more interpretable if the right part shows the output of the enhanced views (input from left part); currently there is no 1:1 parity
局限性
Limitations are adequately addressed
S1. Will this dataset be open sourced?
Our 3DGS Enhancement dataset is generated based on the publicly available DL3DV dataset. We will provide the complete dataset generation code to ensure the community can quickly reproduce our results and explore new research opportunities.
W1. Related work appears to be thin and some of these paragraphs can be more descriptive.
Thank you for the constructive feedback. We will refine our Related Work section in the camera ready version. Specifically, NVS methods dealing with sparse-view inputs will be categorized under "Few-shot novel view synthesis", while more general enhancement efforts, such as super-resolution and artifact removal, will be categorized under "Radiance fields enhancement". More literature under these two categories will be added. These revisions would make the Related Work more descriptive and organized.
W2. Method relies on adjacent views for continuous interpolation.
Our method performs well when there is a significant overlap between two views. Since we use images rendered from a low-quality 3D Gaussian model as conditions for video diffusion model, even if two views are not close to each other, our method can still achieve decent results when these conditional images contain certain structure or texture information. On the other hand, our method can generate free-trajectory conditional images for continuous interpolation, offering more flexibility in addition to linear interpolation between adjacent views.
W3. Choice for STD should be quantitatively ablated. Individual parts in STD like color correction should be ablated as well.
Thank you for the valuable suggestion, we conducted an ablation study on two STD components, the color correction and the temporal layers, on the DL3DV dataset (9 views). Results are detailed below. Both two STD components contribute to performance improvement, while temporal layers contribute more as they ensure STD produces consistent image outputs to improve final reconstruction performance. We will include the ablation study on STD components in the supplementary material.
| Video diffusion | STD (temporal layers) | STD (color correction) | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|
| 18.11 | 0.591 | 0.312 | |||
| 18.44 | 0.625 | 0.306 | |||
| 18.50 | 0.630 | 0.305 | |||
Q1. Authors describe in L184-186 "Despite this significant enhancement in the quality of the rendered views, we propose to rely more on the reference views rather than the restored novel views when fine-tuning the 3DGS model, since the 3DGS model is highly sensitive to slight inaccuracies in the restored views." I would appreciate a more thorough discussion on this.
In the 3DGS training process, we observed that the 3DGS model can fit the training images very well, even if the training images contain noises. Unfortunately, images generated by the diffusion model inevitably introduce various noises and inconsistencies, resulting in a performance decrease.
To mitigate the negative impact of these generated images for the 3DGS model, a straightforward approach is assigning larger weights to real images and generated images of higher quality. From our observations, the generated images that are closer to the reference views tend to have higher quality. Motivated by this, we introduce an image-level weighting scheme. Additionally, we observed that smaller 3D Gaussian volumes are more likely to be well-reconstructed. To prevent well-reconstructed regions from being adversely affected by the generated images, we introduce pixel-level confidence scores which are calculated based on the volume of 3D Gaussians. From Table 2 of the paper, we can see the improvement after incorporating image-level confidence and pixel-level confidence. Thank you for the insightful feedback, we will add more discussions in the camera ready version.
Q2. "modified baselines denoted with ∗ taken from [36]". However, no baseline seems to be marked with * in table 1. This is possibly an oversight and the authors should correct it.
Thank you for the helpful comment. This is an oversight, and we will remove it in the camera ready version.
Q3. What's the adversarial loss described in L179 for STD?
The adversarial loss employed in Eq. 6 is the standard adversarial loss, which trains a discriminator to discriminate between real image and fake image , as Thank you for the valuable comment, we will add it into the camera ready version.
Q4. On what data is the Temporal Denosing U-Net fine tuned on?
Our Temporal Denosing U-Net is fine-tuned on the 3DGS Enhancement Dataset created in this work, which includes a large number of image pairs generated from corresponding high-quality and low-quality 3DGS models. More details of the dataset can be found in Section 5.1 of the paper.
Suggestions: Figure 2 can be more interpretable if the right part shows the output of the enhanced views (input from left part); currently there is no 1:1 parity
Thank you for the helpful suggestion, we will revise Figure 2 of the paper to ensure its right part shows output images corresponding to the input images.
Thank you authors for your detailed reply and conducting the ablation study. I'm happy with your answers and the suggested changes; I shall be increasing my score.
We are very glad that our response and ablation study have met your expectations. We sincerely appreciate your willingness to increase the score.
Thank you again for your time and effort in reviewing our work and providing the constructive comments. Please feel free to let us know if you have any further questions by August 13 AoE, we are more than happy to address them.
This paper utilizes a video diffusion model to enhance 3DGS rendering results. It proposes a 3DGS-Enhancer pipeline to reformulate 3D-consistent image restoration tasks and leverage it to generate high-quality and 3D-consistent images. They have enough experiments to prove the soundness of their method.
优点
They didn't use depth information but video diffusion prior to do the task, which is interesting. They also have the spatial-temporal decoder which solves the temporal consistency problem. For fine-tuning, they add confidence constraints on both pixel level and image level to enhance the result. Extensive comparison with state-of-the-art is done, and the results seem robust.
缺点
They can provide more comparisons on ablating the temporal consistency.
问题
Video Diffusion prior may have motion information of moving objects in the scene. How do we eliminate that effect in enhancing 3DGS?
局限性
The limitation is definitely from the pre-trained video diffusion model whether the data is more like novel view synthesis data (static scene) or a sequence of video frames (moving objects).
W1. They can provide more comparisons on ablating the temporal consistency.
As shown in Fig. 1 of the paper, the images rendered by 3DGS model trained on sparse view inputs often contain significant artifacts or blank areas. We observed that a image diffusion model trained solely on these images is unable to restore high quality images without additional information, and the synthesized multi-view images are often inconsistent even for adjacent views. In addition, for the temporal layers in STD, we conducted an ablation study to verify their effectiveness on DL3DV dataset with 9 views, and the result is detailed below. Temporal layers help STD produce more consistent image outputs to improve final reconstruction performance. Thank you for the helpful suggestion, we will provide more ablation studies on the temporal consistency in the supplementary material.
| Temporal layers in STD | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| 18.11 | 0.591 | 0.312 | |
| 18.50 | 0.630 | 0.305 | |
Q1. Video Diffusion prior may have motion information of moving objects in the scene. How do we eliminate that effect in enhancing 3DGS?
We took two steps to mitigate this effect of moving objects in video diffusion prior. First, we created a paired novel view synthesis dataset from DL3DV to fine-tune the video diffusion model from an image-to-video model to a video-to-video model, ensuring that it outputs images more akin to the novel view synthesis of static scenes. Second, we introduced images rendered from a low-quality 3D Gaussian model as the conditional inputs for video diffusion. Using these images as guidance helps the video diffusion model generate images suitable for 3D static scene reconstruction.
L1. The limitation is definitely from the pre-trained video diffusion model whether the data is more like novel view synthesis data (static scene) or a sequence of video frames (moving objects).
Your statement of the limitation of this work is correct. Since our approach is based on video diffusion, it is indeed limited by the performance of the pre-trained video diffusion model (e.g., SVD). However, our method will also benefit from advances in video diffusion models, since better pre-trained video diffusion models or larger datasets can enhance the potential of our approach.
Thank you again for your time and effort in reviewing our work and providing the constructive comments. Please feel free to let us know if you have any further questions by August 13 AoE, we are more than happy to address them.
Thank you for addressing my concern.
This paper presents a novel pipeline aimed at enhancing the quality of 3D Gaussian splatting (3DGS) representations, especially in scenarios with sparse input views. They propose a novel framework, 3DGS-Enhancer, that leverages video LDMs for generating high-quality and 3D-consistent images. Moreover, to mitigate the artifacts caused by the temporal inconsistency, they introduce a spatial-temporal decoder and fine-tuning strategy for the 3DGS optimization process. extensive experiments are conducted to demonstrate their superior reconstruction performance on large-scale datasets of unbounded scenes.
优点
- Applying video models to scene reconstruction is a novel idea and has achieved excellent results.
- To mitigate the negative impact of inconsistent generated images, this paper proposes improvements to the decoder and defines a confidence measure to enhance the corresponding 3D Gaussian Splatting optimization process.
缺点
-
The writing of this paper is not very clear, and some typo errors exist in the submission, for instance, line147, it should be ; line216: it should be , consistent to the equation (8).
-
Training details are missing in the paper, more details about the video diffusion model and fine-tuning process should be provided.
-
More datasets, such as MipNerf360, should be evaluated to verify the effectiveness and robustness of the proposed method.
问题
None
局限性
None
1.The writing of this paper is not very clear, and some typo errors exist in the submission, for instance, line147, it should be ; line216: it should be , consistent to the equation (8).
Thank you for the helpful comment. You are correct, line147 should be . In line216, should be . We will carefully revise the writing and typos in the camera ready version.
2. Training details are missing in the paper, more details about the video diffusion model and fine-tuning process should be provided.
Our video diffusion model includes a pre-trained VAE to encode an image sequence into a latent sequence and decode the latent sequence back into the image sequence. It also includes a U-Net with learnable temporal layers, which employs cross-frame attention modules and 3D CNN modules to ensure frame-consistent outputs. The input of video diffusion model is a image sequence segment that includes 25 images with different sample steps from the image sequences rendered from the low-quality 3DGS model. The first and the last frames in this segment are replaced with images rendered from the high-quality 3DGS model. During fine-tuning, our video diffusion model is conditioned on these image sequence segments and trained to synthesize the corresponding segments rendered from the high-quality 3DGS model.
Our video diffusion model is fine-tuned with a learning rate of 0.0001, incorporating 500 steps for warm-up, followed by a total of 80,000 training steps. The batch size is set to 1 in each GPU, where each batch consisted of 25 images at 512x512 resolution. To optimize the training process, the Adam optimizer is employed. Additionally, a dropout rate of 0.1 is applied to the conditions between the first and last frames and the training process utilize CFG (classifier-free guidance) to train the diffusion model. The training is conducted on 2 NVIDIA A100-80G GPUs over 3 days. The STD is fine-tuned with a learning rate of 0.0005 and 50,000 training steps. The batch size is set to 1 in each GPU, where each batch consists of 5 images at 512x512 resolution. The fine-tuning process is conducted on 2 NVIDIA A100-80G GPUs in 2 days. Thank you for the valuable feedback, more details of the models and fine-tuning process will be included in the supplementary material.
3. More datasets, such as MipNerf360, should be evaluated to verify the effectiveness and robustness of the proposed method.
Thank you for the constructive feedback, we conducted experiments on the Mip-NeRF 360 dataset, and the results are detailed below and in the attached PDF file in Author Rebuttal section. Our model is trained on the DL3DV dataset, and evaluated on all 9 scenes of Mip-NeRF 360 dataset. Our method significantly outperforms the baselines, although it has not been trained on this complex dataset. This indicates that our method has remarkable cross-dataset generalization capability for unbounded environments, thanks to the powerful 2D video diffusion priors.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|---|
| 6 views | 9 views | |||||
| Mip-NeRF | 13.08 | 0.159 | 0.637 | 13.73 | 0.189 | 0.628 |
| RegNeRF | 12.69 | 0.175 | 0.660 | 13.73 | 0.193 | 0.629 |
| FreeNeRF | 12.56 | 0.182 | 0.646 | 13.20 | 0.198 | 0.635 |
| 3DGS | 11.53 | 0.144 | 0.651 | 12.65 | 0.187 | 0.607 |
| DNGaussian | 11.81 | 0.208 | 0.689 | 12.51 | 0.228 | 0.683 |
| 3DGS-Enhancer (ours) | 13.96 | 0.260 | 0.570 | 16.22 | 0.399 | 0.454 |
Thank you again for your time and effort in reviewing our work and providing the constructive comments. Please feel free to let us know if you have any further questions by August 13 AoE, we are more than happy to address them.
Thanks for your detailed response! You have answered most of my questions.
Based on your rebuttal, I have some minor questions.
- In the appendix, you mention that for the evaluation of the Mip-nerf360 dataset, you use the original point cloud as the initialization to fully explore the ability of baselines. I’m interested in understanding how much your approach benefits from this initialization, given that the original point cloud provides information about unseen views, which is clearly advantageous for your method. You'd better provide the experimental results of the ablation.
- I suggest that you provide the project page or video demos for the final version to present a clear comparison.
I will raise my score.
We are very glad that we have addressed most of your questions. We answer the two minor questions below.
1. Mip-NeRF 360 point cloud initialization.
To ensure a fair comparison with baseline methods and evaluation in real-world scenarios (since sparse-view inputs cannot generate a point cloud as dense as the original one, and the NeRF-based methods cannot fully exploit the original point cloud which provides additional information from unseen views), we followed DNGaussian to adopt random point cloud initialization for all methods, including the baselines and our method, for both Mip-NeRF 360 and DL3DV datasets. All experimental results presented in the initial submission and the rebuttals are based on random point cloud initialization. Thank you for this detailed comment, we will revise the Appendix.
2. Project page or video demos for the final version.
Thank you for the helpful suggestion, we will update the project page according to the revisions made during and after the rebuttal period, including: 1) video demos on the Mip-NeRF 360 dataset; 2) video demos comparing more baseline methods including DNGaussian, RegNeRF, and FreeNeRF.
Thanks for your response! My concerns have been solved!
This paper is working on the problem of novel view synthesis with sparse input views. The authors present 3DGS-Enhancer to enhance the rendering quality and address 3D view consistency problem using 2D video diffusion priors. The experimental results show that this work has achieved state-of-the-art performance in novel view synthesis enhancement.
优点
- Writing quality is great, clean, and easy to follow.
- To tackle the sparse view reconstruction problem, the idea of fine-tuning the original 3DGS representation instead of incorporating additional geometric constraints is novel and promising.
- A new dataset is introduced in this work, which is useful for this research direction and the community.
- The idea of leveraging video LDM for sparse view reconstruction in unbounded outdoor scenes is novel and effective.
缺点
Runtime is not discussed in this paper. I’m wondering how long it will take for one scene compared to the other baselines.
问题
The experiments are conducted almost on the self-introduced dataset. Can 3DGS-Enhancer also surpasses the other baselines on more commonly used outdoor unbounded datasets (e.g. Mip-NeRF 360 dataset)?
局限性
N/A
W1. Runtime is not discussed in this paper. I’m wondering how long it will take for one scene compared to the other baselines.
As shown in the below table, we estimate the per-scene runtime and rendering FPS of different methods on the DL3DV test set (3 views) with one NVIDIA A100 GPU. Our method takes 24.5 minutes for one scene, including 10.5 minutes for low-quality 3DGS training, 2.0 minutes for SVD inference, and 12.0 minutes for high-quality 3DGS training. Thank you for the helpful suggestion, we will include the efficiency analysis in the supplementary material.
| Method | Per-scene runtime ↓ | Rendering FPS ↑ |
|---|---|---|
| Mip-NeRF | 10.7h | 0.09 |
| RegNeRF | 2.5h | 0.09 |
| FreeNeRF | 3.8h | 0.09 |
| 3DGS | 10.5min | 100 |
| DNGaussian | 3.3min | 100 |
| 3DGS-Enhancer (ours) | 24.5min (LQ-3DGS 10.5min, SVD 2.0min, HQ-3DGS 12.0min) | 100 |
Q1. The experiments are conducted almost on the self-introduced dataset. Can 3DGS-Enhancer also surpasses the other baselines on more commonly used outdoor unbounded datasets (e.g. Mip-NeRF 360 dataset)?
Thank you for the constructive feedback, we conducted experiments on the Mip-NeRF 360 dataset, and the results are detailed below and in the attached PDF file in Author Rebuttal section. Our model is trained on the DL3DV dataset, and evaluated on all 9 scenes of Mip-NeRF 360 dataset. Our method significantly outperforms the baselines, although it has not been trained on this complex dataset. This indicates that our method has remarkable cross-dataset generalization capability for unbounded environments, thanks to the powerful 2D video diffusion priors.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|---|
| 6 views | 9 views | |||||
| Mip-NeRF | 13.08 | 0.159 | 0.637 | 13.73 | 0.189 | 0.628 |
| RegNeRF | 12.69 | 0.175 | 0.660 | 13.73 | 0.193 | 0.629 |
| FreeNeRF | 12.56 | 0.182 | 0.646 | 13.20 | 0.198 | 0.635 |
| 3DGS | 11.53 | 0.144 | 0.651 | 12.65 | 0.187 | 0.607 |
| DNGaussian | 11.81 | 0.208 | 0.689 | 12.51 | 0.228 | 0.683 |
| 3DGS-Enhancer (ours) | 13.96 | 0.260 | 0.570 | 16.22 | 0.399 | 0.454 |
Thank you again for your time and effort in reviewing our work and providing the constructive comments. Please feel free to let us know if you have any further questions by August 13 AoE, we are more than happy to address them.
Thanks for addressing my concerns! Please add the above experimental results in your revision!
The authors thank all reviewers for the careful review and constructive feedback. We are encouraged that all four reviewers appreciate the novel idea and excellent experimental results of this work. We address all the raised concerns in corresponding reviewer's rebuttal section. The code and dataset will be publicly available.
Mip-NeRF 360 Result (Cross-dataset Generalization Experiment):
Thanks to constructive feedback from Reviewers 19Ng and KbVQ, we conducted a cross-dataset generalization experiment on the Mip-NeRF 360 dataset, and the results are shown in the below table and attached PDF file. Our video diffusion and STD models are trained on the DL3DV dataset, and evaluated on all 9 scenes of Mip-NeRF 360 dataset. Results show that our method significantly outperforms the baselines, indicating it has remarkable cross-dataset generalization capability for unbounded environments, thanks to the powerful 2D video diffusion priors.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|---|---|
| 6 views | 9 views | |||||
| Mip-NeRF | 13.08 | 0.159 | 0.637 | 13.73 | 0.189 | 0.628 |
| RegNeRF | 12.69 | 0.175 | 0.660 | 13.73 | 0.193 | 0.629 |
| FreeNeRF | 12.56 | 0.182 | 0.646 | 13.20 | 0.198 | 0.635 |
| 3DGS | 11.53 | 0.144 | 0.651 | 12.65 | 0.187 | 0.607 |
| DNGaussian | 11.81 | 0.208 | 0.689 | 12.51 | 0.228 | 0.683 |
| 3DGS-Enhancer (ours) | 13.96 | 0.260 | 0.570 | 16.22 | 0.399 | 0.454 |
All the reviewers agree to accept this paper. It presents a novel method to enhance the quality of 3DGS representations by leveraging 2D video diffusion priors. Additionally, the paper introduces a spatial-temporal decoder that effectively addresses the issue of temporal consistency. The experimental results both qualitative and quantitive show significant improvements compared to the baselines. All the reviewers' concerns have been addressed.