From Chaos to Clarity: 3DGS in the Dark
We enhance 3D Gaussian Splatting for HDR reconstruction from noisy raw images, outperforming prior models on the RawNeRF dataset.
摘要
评审与讨论
This paper introduces a new framework for 3D reconstruction and denoising from raw images. Specifically, a noise extractor and a noise-robust reconstruction loss are proposed to deal with the overfitting issue of 3DGS to the noises heavily distributed in raw images. Experiments on the rawnerf dataset and ablation studies are provided to demonstrate the effectiveness of the proposed methods.
优点
- Intuitive noise decomposition and informative noise removal (or estimation) method
- Better reconstruction and denoising performance compared to 3DGS
- Good writing and presentation
缺点
- The innovation may not be enough for this conference. The major contribution of this paper is the denoising part, where the noise estimator needs pre-training for good performance (extra data is required).
- The performance improvement over baseline with rawnerf loss is marginal and even no synthesis quality improvement compared to the reported scores in the original RawNeRF paper (which is based on NeRF).
问题
Please refer to the weakness part.
局限性
The limitations mentioned in the paper are not easy to address, so it did not affect my rating for this paper.
We appreciate your valuable suggestions and comments. Here are our detailed responses to your feedback:
-
Novelty Limited Due to the Need for Noise-Clean Paired Data:
We want to clarify that the pretrained model aims to provide a good initialization and does not have strict restrictions on the data used for training (e.g., we can use data from different devices or noise-only data). Specifically, our pretrained dataset, SID (captured by a DSLR camera), differs significantly from the RawNeRF dataset (captured by an iPhone X) in terms of color space and bit depth. To further verify the impact, we also experimented with a self-supervised pretraining method like Neighbor2Neighbor with noise-only images in SID. The results, shown in Table R1, indicate no significant difference, supporting that noise-clean paired images are not strictly necessary during the pretraining stage.
-
Comparison with Baselines and RawNeRF:
We want to clarify that in the few-shot settings (which are more common in real-world scenarios like 3D reconstruction using multi-camera setups or for autonomous driving) on the RawNeRF dataset, we achieved a 3-4 dB gain compared to the rawnerf loss baselines as presented in Figures 5 and 6. In full-view settings, we also observed around a 0.9 dB gain compared to 3DGS with the rawnerf loss baseline. While our method's reconstruction quality is similar to RawNeRF, we significantly reduced the training time by 100x and boosted the rendering speed by 5000x, as shown in Table R2, making it more feasible for real-world applications.
Once again, we sincerely thank you for your constructive feedback, which has been crucial in refining our work.
Tables:
Table R1: Ablation Study of F Pretraining on RawNeRF Datasets (Full Training Views)
| Method | F Pretrained Dataset | F Pretrained Method | Raw PSNR | RGB PSNR | RGB SSIM | RGB LPIPS |
|---|---|---|---|---|---|---|
| Ours | SID (noise) | self-supervised (Neighbor2Neighbor) | 59.32 | 23.45 | 0.530 | 0.505 |
| Ours | SID (noise-clean) | supervised | 59.49 | 23.53 | 0.535 | 0.499 |
Table R2: Quantitative Comparison on RawNeRF Datasets (Full Training Views)
| Method | Loss | Training Time | FPS | Raw PSNR | RGB PSNR | RGB SSIM | RGB LPIPS |
|---|---|---|---|---|---|---|---|
| RawNeRF | 140h | 0.01 | 59.07 | 23.53 | 0.538 | 0.500 | |
| HDR Scaffold-GS | 1.6h | 73 | 58.08 | 22.69 | 0.521 | 0.513 | |
| Ours | 3.1h | 80 | 59.49 | 23.53 | 0.535 | 0.499 |
Thanks for the author's reply. I still have some concerns regarding the performance of this paper:
- In the setting of full training views, without using extra clean data, your model cannot outperform RawNeRF in terms of post-processed RGB metrics.
- What is the performance of RawNeRF in the few-shot setting? Considering the major contribution of this paper is a new training paradigm without modifying the original novel view synthesis model, I think it would be better to apply this method to NeRF and compare it with RawNeRF to fully demonstrate its effectiveness in the few-shot setting.
- The improved training and inference speeds mainly originated from the 3DGS itself, instead of the design in this paper.
However, given the effectiveness of this method on 3DGS and the excellent writing of this paper, I am inclined to give borderline acceptance.
Dear Reviewer h6sN,
Thank you sincerely for your reply and suggestions!
Few-shot Setting of RawNeRF
Training RawNeRF takes about one week per scene, so we currently do not have enough time to evaluate its performance. However, given that our method shows more significant improvements in the few-shot setting, we anticipate it may also perform better in this context. We will include this comparison in the final version.
Improved Speed
We acknowledge that the improved speed primarily comes from 3DGS. However, when compared to both 3DGS and 3DGS+RawNeRF loss, our method achieves enhanced rendering speed (approximately 4 times faster with a limited 4-view setting).
Intrinsic Performance Gap Between 3DGS and NeRF
3DGS and NeRF represent different approaches to 3D representation. Prior to Scaffold-GS, 3DGS methods had lower reconstruction quality compared to NeRF, despite their faster rendering speeds. While 3DGS offered faster inference than RawNeRF, it struggled with noise overfitting, leading to a significant gap in RGB reconstruction quality (22.69 vs. 23.53 in PSNR).
Our key contribution addresses this gap: we analyzed why 3DGS overfits to noise and proposed a self-supervised solution that narrows this gap, achieving parity with RawNeRF in RGB reconstruction quality (23.53 vs. 23.53 in PSNR). Additionally, our method further improves the already fast rendering speed of 3DGS.
We hope this clarifies the unique contributions and impact of our work.
Best regards,
Authors
The paper tries to use raw images with high dynamic range (HDR) for training 3D Gaussian Splatting (3dgs). It first analyzes how noise from raw image affects the optimization of 3dgs especially when the number of training views is small. To address this issue, first it uses a lens distortion correction network before training to correct the distortions of raw images. Then it introduces a noise extractor to predict noise from raw image and presents a novel noise-robust reconstruction loss which consists of a term of RawNeRF loss, a NLL measuring the divergence between the estimated noise and the physical noise, and a term for decorrelating noises across pixels. The method outperforms baselines on RawNeRF dataset on rendering quality and inference speed.
优点
- The paper mainly tries to use a prior noise model to eliminate the effects of noise in raw image, pushing a step forward to utilizing HDR raw images in 3dgs. 2.The paper includes extensive comparison with baselines including LDR Scaffold-GS, HDR Scaffold-GS and two-stage methods with various denoiser as first stage. And it achieves supreme results compared to these baselines.
- The paper analyzes and demonstrates the effect of noises from raw images on 3dgs optimization and its relationship with the number of training views. It does experiments on both full views and limited views. The whole structure is well-organized and complete.
缺点
- The method assumes clean pixel values of a 3d point projected on different images remain the same. But these images are taken from different viewpoints. How does it handle the effects of viewing directions for non-Lambertian surfaces?
- The loss consists of two terms. I am especially curious about the effect of the . How much would it help with the method?
问题
Please see the weaknesses.
局限性
- The calibration process for camera noise model parameters seems to be pretty time-consuming and requires capture a lot of images, e.g. in L413 "capture 100 dark frames at each ISO in a dark room". Wonder if this method can be applied to images captured in a more home-like setting or it always requires a dataset prepared in a professional setting.
We appreciate your valuable suggestions and comments. Here are our detailed responses to your feedback:
-
Lambertian Surface Assumption:
We use the Lambertian surface assumption to simplify our analysis. However, our method can also be applied to non-Lambertian surfaces, such as mirror reflections on a piano. The core of our method is the integration of a noise model to combine the noise prior with 3DGS reconstruction. Since 3DGS supports non-Lambertian surfaces, our method is also applicable in such scenarios.
-
Effect of :
The term ensures that noise remains independent across pixels and maintains the standard deviation of the noise distribution within a pixel. We want to clarify that the ablation of () is already presented in Table 4 of our paper. We also provide a visual illustration in Figure R3 of our submitted PDF. As shown, ensures the noise remains independent across different pixels and adjusts the extracted noise distribution within a pixel to follow a standard Gaussian distribution after normalization. Since only ensures the mean of extracted noise is zero, it does not control the standard deviation. Therefore, constraining the diagonal of the covariance matrix in is necessary to ensure the standard deviation is one.
-
Home-like Setting:
We have found that the noise model can also be calibrated in a home-like setting as follows:
a. Flat Frames Capture: Use a MacBook screen to capture flat frames. Adjust the focus of the mobile phone (iPhone X) camera to infinity, attach it tightly to the screen, and adjust the screen brightness to capture a series of flat frames at different ISO and exposure levels. This process is illustrated in Figure R4(a) of the submitted PDF.
b. Dark Frames Capture: Use a drawer and a headphone to capture dark frames. Connect the mobile phone (iPhone X) with a (wired/wireless) headphone, open the camera app, place the phone in the drawer, close the drawer, and use the headphone's play button to remotely take photos. Capture 50 dark frames at different ISOs. This setup is shown in Figure R4(b) of the submitted PDF.
These settings can be easily achieved in a home environment. We found that the noise parameters calibrated this way are very close to those calibrated in a lab setting (dark room) ( 7.38 vs. 7.06). The impact on the final 3DGS reconstruction performance on RawNeRF is less than 0.2dB. Additionally, we have developed a semi-automatic labeling app for Android mobile phones, as shown in Figure R4(c) of the submitted PDF, which we will release upon the paper's acceptance. We hope this tool will enable more users to enjoy 3DGS reconstruction in low-light conditions using their own mobile phones.
Once again, we sincerely thank you for your constructive feedback, which has been crucial in refining our work.
Thank you for answering my questions. My questions are solved. So I'll keep my positive rating.
Dear reviewer,
The authors have submitted a rebuttal to your comments. Could you please read it and respond? The deadline for your reply is within one day. Thank you.
AC
This paper proposes a novel self-supervised learning framework to reconstruct HDR 3D Gaussian Splatting (3DGS) from noisy raw images. This addresses the issue of noise degrading reconstruction quality and inference speed in 3DGS, especially in scenarios with limited views. The proposed method demonstrates superior performance compared to LDR/HDR 3DGS and previous state-of-the-art models in both reconstruction quality and inference speed on the RawNeRF dataset.
优点
- The paper is well written, the methodology is well motivated and technical sound.
- The overall problem is relevant and the introduction of a noise-free loss functions for 3DGS is plausible and valid contribution. The approach of trying to model the noise for a noise-robust representation is a original idea.
缺点
- The experimental evaluation misses a comparison to NeRF-based methods on RAW data, such as RawNeRF. While the NeRF-based methods have their downsides, they are still a good baselines to judge the actual denoising capabilities of 3DGS-based methods.
- The impact of the noise extractor F is still unclear to me. The authors mention that this needs to be pre-trained with a paired dataset of noisy and noise-free image pairs. It would be good to investigate this component more by e.g. providing ablation of this part.
问题
Figure 5 is hard to understand since there are these Loss function symbols. It remains unclear to me what exactly is visible there. Maybe it would be good to show only the most relevant baseline in these figures.
局限性
The authors provide a section for discussing the limitiations their methods.
We sincerely thank you for your thorough review and valuable suggestions. Here are our detailed responses to your comments:
-
Comparison with NeRF Baselines:
We acknowledge the importance of comparing our method with NeRF-based methods. We have conducted a comparison between NeRF and HDR Scaffold-GS baselines and our proposed method in Table R1. We observed that the unmodified 3DGS (HDR Scaffold-GS) exhibited a gap in rendering quality compared to the NeRF-based method (RawNeRF), as 3DGS is more prone to noise. This is discussed in our manuscript and illustrated in Figure R2 of our submitted PDF. Our method bridges this gap, achieving similar performance to RawNeRF while significantly reducing training time (by a factor of 100) and rendering speed (by a factor of 5000), making raw 3D reconstruction feasible in real-world applications.
-
Ablation of the Noise Extractor F:
We want to clarify that pretraining the noise extractor F is necessary because U-Net struggles to converge and extract high-frequency noise within 30,000 iterations during 3DGS training. Pretraining F ensures it can effectively extract high-frequency noise. Our pretrained dataset, SID (captured by a DSLR camera), differs significantly from the RawNeRF dataset (captured by an iPhone X) in terms of color space and bit depth. To verify the impact, we also experimented with a self-supervised method like Neighbor2Neighbor (using one output as the clean image) with noise-only images in SID. The results, shown in Table R2, indicate no significant difference, supporting noise-clean paired images are not strictly necessary during the pretraining stage.
-
Clarification of Figure 5:
We apologize for the unclear presentation in Figure 5. The solid line in Figure 5 represents the one-stage method, corresponding to the last three rows in Table 1. The dashed line denotes the two-stage method, corresponding to the rest of Table 1. The one-stage method directly trains the 3DGS on the noisy images with different loss functions. In the two-stage method, the input is first denoised using a pretrained denoiser, and then the 3DGS is trained with RawNeRF loss functions. We will clarify this in the revised version of our manuscript for better clarity.
Once again, we sincerely thank you for your constructive feedback, which has been crucial in refining our work.
Tables:
Table R1: Quantitative Comparison on RawNeRF Datasets (Full Training Views)
| Method | Loss | Training Time | FPS | Raw PSNR | RGB PSNR | RGB SSIM | RGB LPIPS |
|---|---|---|---|---|---|---|---|
| RawNeRF | 140h | 0.01 | 59.07 | 23.53 | 0.538 | 0.500 | |
| HDR Scaffold-GS | 1.6h | 73 | 58.08 | 22.69 | 0.521 | 0.513 | |
| Ours | 3.1h | 80 | 59.49 | 23.53 | 0.535 | 0.499 |
Table R2: Quantitative Comparison on RawNeRF Datasets (Full Training Views)
| Method | F Pretrained Dataset | F Pretrained Method | Raw PSNR | RGB PSNR | RGB SSIM | RGB LPIPS |
|---|---|---|---|---|---|---|
| Ours | SID (noise) | self-supervised (Neighbor2Neighbor) | 59.32 | 23.45 | 0.530 | 0.505 |
| Ours | SID (noise-clean) | supervised | 59.49 | 23.53 | 0.535 | 0.499 |
Dear authors,
Thank you , I appreciate your efforts in answering my concerns. The Table R1 indeed shows that the quality is on par with RawNeRF and the explanation of the Noise extractor f makes sense. Thus I increase my rating to a weak accept.
Best Reviewer JuHQ
Dear Reviewer JuHQ,
We would like to express our deepest gratitude for your thorough review and invaluable feedback on our paper. We thank you for recognizing the novelty and our comparative experiments with RawNeRF, which is highly encouraging and serves as a great motivation for our continued research efforts. We are also sincerely thankful for your prompt response, which has greatly contributed to the improvement of our paper.
Best regards,
This paper investigates the issue of 3DGS overfitting noises in input images, and proposes a self-supervised learning framework as the solution. The paper integrates a noise model as prior to relax the constraints in the 3DGS optimization framework.
优点
- The paper provides a detailed analysis of how noise impacts the optimization of 3DGS.
- The proposed framework leverages a physical-based noise model to jointly denoise and enhance 3DGS with noisy inputs.
- The paper is well-written.
缺点
- It is better to introduce comparisons on more sparse-view settings such as LLFF datasets.
问题
see weakness
局限性
The authors have discussed the limitations.
We sincerely thank you for your review and valuable suggestions on our paper. In response to your suggestion regarding comparisons on more sparse-view settings, we have conducted additional quantitative comparisons using the LLFF dataset with simulated noisy raw images. Below are the details of our approach and findings:
To address the more sparse-view settings (3-views), we used the inverse ISP proposed in [1] to convert the LLFF dataset RGB images into RAW images. We then added synthetic noise following the iPhone X noise model and adhered to the baseline methods' settings [2] for a fair comparison. Specifically, we used FSGS [2] as the 3D representation baseline and only modified the loss function to our proposed . The results, presented in Tables R1, demonstrate that our method consistently outperforms other methods even with limited view settings.
Once again, we appreciate your constructive feedback and hope that our response addresses your concerns.
Tables and References:
Table R1: Quantitative Comparison on LLFF Datasets (3 Training Views)
| Method | Loss | Raw PSNR (1/4)* | RGB PSNR (1/4)* | FPS (1/4)* | Raw PSNR (1/8)* | RGB PSNR (1/8)* | FPS (1/8)* |
|---|---|---|---|---|---|---|---|
| BM3D | 42.902 | 18.032 | 163 | 22.596 | 17.242 | 272 | |
| PMN | 42.585 | 18.064 | 201 | 23.208 | 18.572 | 389 | |
| Neighbor2Neighbor | 42.774 | 17.977 | 152 | 23.008 | 18.297 | 372 | |
| FSGS | 41.197 | 15.496 | 65 | 21.183 | 15.166 | 132 | |
| Ours | 43.418 | 18.753 | 216 | 23.799 | 18.902 | 370 |
* * denotes the downsample ratio of the resolution following the same settings as in [2].
References:
[1] Reverse Imaging Pipeline for Raw RGB Image Augmentation; Samu Koskinen, Dan Yang, and Joni-Kristian Kämäräinen, ICIP 2019
[2] FSGS: Real-Time Few-Shot View Synthesis using Gaussian Splatting; Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang, ECCV 2024
This paper presents a self-supervised learning framework for reconstructing HDR 3DGS from noisy raw images. The proposed method integrates a noise extractor and a noise-robust reconstruction loss to mitigate the effects of noise in raw images. By leveraging a noise distribution prior, the framework improves both the reconstruction quality and inference speed of 3DGS, particularly in scenarios with limited training views. Experimental results on the RawNeRF dataset show that the proposed approach significantly outperforms existing state-of-the-art methods in terms of rendering quality and speed, providing a robust solution for HDR 3DGS reconstruction in challenging lighting conditions.
优点
- Proposed to both denoise first and denoise on-the-fly, then using 3DGS based method to reconstruct the scene. Perform better than the other denoising method.
- Gain better performance with fewer views.
缺点
- It seems that there are only two scenes for qualitative comparison. To the best of my knowledge, there are only three scenes for testing in the RawNeRF dataset, with each scene containing only one view. This suggests a potential lack of evaluation. The author should provide more results, at least for qualitative comparison.
- The proposed analysis seems to be not very relevant to the main text, and I did not find its connection with the proposed noise-robust reconstruction loss. Why is a self-supervised denoiser needed if N (in Eq. 9) is small? And if N is large, is a self-supervised denoiser still necessary? Additionally, the proposed analysis is very similar to the motivation of RawNeRF and burst denoising, which uses more (large N in Eq. 9) unstabilized noisy views for denoising. Therefore, I do not see the novelty in the first contribution.
- the author seems to have not retrained the SOTA denoisers (e.g., ELD, PMN) with the calibrated noise model of the iPhone X. This presents an unfair comparison. I think the authors should provide results using the iPhone X noise model retrained, given that the proposed method also requires noise calibration. If the results are comparable to those of the proposed method, then why is the proposed method needed? Wouldn't it be sufficient to perform denoising first and then reconstruction simply?
- the author did not provide results compared with the original 3DGS. Why is that? It is not a big deal, but I think a comparison with RawNeRF should be provided (at least in Table 1, the results with full training views).
- There might be a typo: if the authors directly use the pretrained checkpoint (as mentioned in lines 224-225), why is the loss for ELD, PMN, etc. denoted as in Table 1?
问题
See the weakness.
局限性
Yes
We deeply appreciate your thorough review and valuable feedback on our submission. Here are our detailed responses:
-
More Quantitative and Qualitative Results:
Given the limitations of the RawNeRF dataset, we conducted additional quantitative comparisons using the LLFF dataset with simulated noisy raw images and provided qualitative comparisons for the training scenes in the RawNeRF dataset. Specifically, we used the inverse ISP proposed in [1] to convert LLFF RGB images into RAW images, added synthetic noise following the iPhone X noise model, and adhered to baseline methods' settings [2]. We used FSGS [2] as the 3D representation baseline, modifying only the loss function to our proposed . The results are listed in Table R1. Our method consistently outperforms others even with limited view settings and diverse training scenes. For qualitative results, please refer to Fig. R1 in our submitted PDF, which shows that our method produces smoother results and reduces artifacts.
-
Connection Between the Analysis and Main Text:
Our analysis highlights an overlooked issue: 3DGS is more vulnerable to noise in few-shot settings compared to neural radiance field-based 3D reconstruction. Unlike RawNeRF, 3DGS requires explicit noise regularization. Specifically, in Sec. 4.1, we illustrate the connection between (N) and the bias of the 3DGS optimal target and real-world point color. As (N) decreases, the bias increases, necessitating a denoiser to prevent 3DGS points from overfitting to noisy training views, as shown in Fig. 6 of our manuscript. This issue is common in scenarios like multi-camera 3D reconstruction and street scene reconstruction in autonomous driving. Although the effect diminishes with larger (N), it remains significant, as evidenced by our results with 100-view settings (Fig. 7 and Table 1).
-
Novelty of First Contribution:
The impact of noise is unique in 3DGS compared to NeRF-based methods. MLPs in NeRF can perform low-frequency filtering, making it challenging to overfit to high-frequency noise, as explored in [3]. RawNeRF requires ~5 days to train a single scene and 2 minutes to render a new view in 4K resolution, whereas 3DGS can achieve training in 2 hours and 50 FPS rendering speed. However, 3DGS tends to generate thin-flat noise, degrading rendering speed and quality. Visual details comparing NeRF and 3DGS can be found in Fig. R2 of our submitted PDF. We revise our first contribution in the intro. as follows:
We explore the unique impact of noise on 3DGS and its relationship with the number of input views. We highlight that 3DGS is more vulnerable to noise than neural radiance field-based 3D reconstruction, especially in few-shot settings, and provide a detailed analysis of how noise impacts the optimization of 3DGS, modeling its relationship with the number of training views and the noise distribution.
-
Training ELD/PMN with iPhone X Noise Model:
There is a gap between simulated iPhone X noise on DSLR images and real noisy iPhone X images due to differences in bit depth (14-bits vs. 12-bits). Following your suggestion, we simulated noise-clean paired data using the iPhone X noise model and performed denoising first, followed by reconstruction. The results, listed in Table R2, do not show significant performance improvement, likely due to sensor differences between the iPhone X and DSLR cameras.
-
Comparison with 3DGS and RawNeRF:
We agree that a comparison with 3DGS and RawNeRF is valuable. We have added this comparison in Table R2 under full-view settings. Our method shows that the original 3DGS is more prone to noise, resulting in lower reconstruction quality. Although NeRF-based methods achieve similar results due to their low-pass characteristics, they require significantly more time (5000x for rendering a single image and 100x for training) compared to our method, making them impractical for real-world applications.
-
Pretrained Checkpoint and Loss Function:
The denoising method in Table 1 is used to first denoise the input, and then the loss function is applied to 3DGS for the 3D reconstruction loss. Thus, the loss function is only used during the 3D reconstruction training period.
Once again, we sincerely thank you for your constructive feedback.
Tables and References:
Table R1: Comparison on LLFF Datasets (3 Training Views)
| Method | Loss | Raw PSNR (1/4*) | RGB PSNR (1/4*) | FPS (1/4*) | Raw PSNR (1/8*) | RGB PSNR (1/8*) | FPS (1/8*) |
|---|---|---|---|---|---|---|---|
| BM3D | 42.902 | 18.032 | 163 | 22.596 | 17.242 | 272 | |
| PMN | 42.585 | 18.064 | 201 | 23.208 | 18.572 | 389 | |
| Neighbor2Neighbor | 42.774 | 17.977 | 152 | 23.008 | 18.297 | 372 | |
| FSGS | 41.197 | 15.496 | 65 | 21.183 | 15.166 | 132 | |
| Ours | 43.418 | 18.753 | 216 | 23.799 | 18.902 | 370 |
- denotes the downsample ratio of the resolution, as same as [2].
Table R2: Comparison on RawNeRF Datasets (Full Training Views)
| Method | Loss | TrainingTime | FPS | RawPSNR | RGBPSNR | RGBSSIM | RGBLPIPS |
|---|---|---|---|---|---|---|---|
| ELD | 1.5h | 80 | 54.70 | 19.82 | 0.511 | 0.544 | |
| PMN | 1.4h | 94 | 53.69 | 19.00 | 0.498 | 0.584 | |
| ELD(fine-tuned) | 17.7h | 79 | 55.01 | 19.96 | 0.514 | 0.531 | |
| PMN(fine-tuned) | 18.6h | 92 | 53.97 | 19.22 | 0.503 | 0.575 | |
| HDR3DGS | 1.1h | 75 | 57.39 | 20.83 | 0.518 | 0.569 | |
| RawNeRF | 140h | 0.01 | 59.07 | 23.53 | 0.538 | 0.500 | |
| HDRScaffold-GS | 1.6h | 73 | 58.08 | 22.69 | 0.521 | 0.513 | |
| Ours | 3.1h | 80 | 59.49 | 23.53 | 0.535 | 0.499 |
References:
[1] Reverse Imaging Pipeline for Raw RGB Image Augmentation
[2] FSGS: Real-Time Few-Shot View Synthesis using Gaussian Splatting
[3] DINER: Disorder-Invariant Implicit Neural Representation
Dear Reviewer m8Xp,
Thank you very much for your insightful review and the valuable suggestions you provided. We have carefully considered your feedback and have addressed your comments thoroughly in our rebuttal. We believe that the clarifications and improvements we’ve made should effectively address the concerns you raised.
If you find that our responses have satisfactorily resolved the issues, we kindly ask you to consider adjusting your rating accordingly.
We are more than happy to provide any additional information or discuss further if needed.
Thank you again for your time and thoughtful consideration.
Best regards,
Thanks for the author's response and thanks for the highlighted lines. However, there are still some aspects of your reply that i dont fully understand. Hope the authors would provide further clasrification.
- As for the newly provided qualitative and quantatitive results:
- What led the author to not provide the qualitative results of RawNeRF in Fig. R1?: I did not find the RawNeRF reconstruction results from the corresponding viewpoints in Fig R1. However, from my best knowledge, the reconstruction results of RawNeRF are much better than those of the authors' proposed method (at least at the full-view settings). As the authors mentioned that "We agree that a comparison with 3DGS and RawNeRF is valuable.", i think it is important to provide the results from RawNeRF. At least dicuss the reasons that led to the visual quality being inferior to RawNeRF (primarily due to the retention of low-frequency noise, which manifests in the results as a global color cast).
- The qualitative results provided by the authors do not effectively demonstrate the improvements over other methods. Especially in the second row, where HDR-Scaffold clearly outperforms the proposed method in terms of reconstruction quality (e.g., the grass next to the statue).
- Training ELD/PMN with iPhone X Noise Model: i don't think bit depth is the primary cause of the performance degraded. Moreover, quantizing from 14-bit to 12-bit can be achieved in a very simple way.
- Sth. about the analysis: As the authors mentioned that "As (N) decreases, the bias increases, necessitating a denoiser to prevent 3DGS points from overfitting to noisy training views", than why is that all the denoise+3DGS method still suffers from color shift (mainly caused by the bias), as shown in Fig R1? Besides, the analysis only mentioned that denoising is necessary. Could you clarify the connection between the analysis and the proposed self-supervised method?
Dear Reviewer m8Xp,
Thank you for your follow-up questions, allowing us to clarify further.
Comparison with RawNeRF in the Training Views:
Why RawNeRF Results Are Not in Fig. R1
-
The extensive training time required for RawNeRF (approximately a week per scene) limited our ability to complete all desired scenes within the rebuttal period. We used all three A100s to train the testing scenes in RawNeRF dataset and report the results in Table R2. As a result, Fig. R1 could not include additional RawNeRF comparisons for those specific viewpoints.
-
The notable color bias observed between RawNeRF visualizations and ours in Fig. R1 might be attributed to differences in white balance settings. In RawNeRF's test sets, the output colors are finely adjusted using reference images. However, we can not adjust the color in Fig. R1 in such a way since it is not feasible for training views due to the absence of comparable reference data.
Concerning Full-view Results Comparison:
- Our study primarily focuses on few-shot scenarios, yet we still provide full-view results for completeness. Our RawNeRF is trained using exactly the official code. The visualization of our result is similar with that in the RawNeRF paper (the third row in the RawNeRF Fig. A1), e.g., the appearance of the artifacts. Our implementation achieves almost the same results in the rendered RGB space as follows. Therefore, we believe the correctness of our implementation given the above consideration.
| PSNR | SSIM | LPIPS | |
|---|---|---|---|
| RawNeRF in Paper (Table 1) | 23.53 | 0.536 | 0.501 |
| Ours (Table R2 in Rebuttal) | 23.53 | 0.538 | 0.500 |
- As depicted in Fig. R2, our results achieve comparable reconstruction quality to RawNeRF. However, a significant enhancement in rendering speed from 0.1 FPS to 134 FPS underscores the practical improvements our method offers. Moreover, for consistent assessment across all methods, the same color correction applied in RawNeRF is used for quantitative and qualitative comparisons in our main paper.
Trade-off Between Preserving Sharp Details and Reducing Artifacts, Enhancing Rendering Speed, and Boosting PSNR
We want to clarify that our method shows significant improvements, particularly in scenarios with limited training views, as you noted. We have effectively reduced artifacts in such cases, as seen in the first row of Fig. R1. However, in full-view settings, some blurring can occur, as mentioned in our manuscript's limitations. This is a trade-off between preserving sharp details and reducing artifacts, boosting rendering speed, and improving PSNR. For instance, in the second row of Fig. R1, the rendering speed of HDR Scaffold-GS versus our method is 52 FPS versus 96 FPS, respectively. In the testing views, we increase the PSNR of HDR Scaffold-GS from 58.08dB to 59.49dB.
ELD/PMN Domain Gap
As mentioned in our rebuttal, the domain gap is not only due to bit depth differences but also other sensor characteristics, such as the color response curve. This color bias is also noted in the RawNeRF paper:
"Since each method was trained on raw data from a different source, they impart different color tints to the output ... we calculate a per-color-channel affine transform that best matches each method’s raw output to the ground truth raw image."
In training scenes, we do not have the ground truth to perform a per-color-channel affine transform, which may result in the color bias observed in our response figures.
The Analysis
The reason to introduce the denoising prior when is small is to reduce the high-frequency artifacts, e.g., needle-like artifacts in HDR-Scaffold. As shown in the figures, introducing a denoiser effectively reduces these artifacts. The color bias is mainly due to the DC component, i.e., the bias you mentioned.
For the reason of using a self-supervised manner, the rationale is as follows:
- Self-supervised denoising is a meaningful topic that many researchers are working on. It has unique advantages in cases where the clean image domain is unavailable, e.g., 4D Gaussian modeling of dynamic scenes in the dark.
- While the noise model in raw space is relatively clear, only having a noise prior cannot obtain the clean image/scene. Similar to the deep image prior (DIP) , we are the first to find that 3DGS also has a similar 3D scene prior as shown in our analysis.
- While using 3DGS alone can somehow denoise the scene by controlling the number of iterations as in Fig. 3, it has the risk of overfitting to the noise. We are also the first to combine the noise prior and 3DGS prior in a self-supervised manner. In such a way, we can avoid overfitting to the noise and achieve better results, e.g., superior PSNR and FPS.
We hope these address your concerns. Please feel free to reach out with further questions.
Deep image prior, CVPR 2018
Thanks for the author's further reply. But i still want to maintain my initial score because i think there might be a "weak evaluation" on qualitative results.
- Regarding the color shift in Fig. R1: Indeed, RawNeRF applied per-channel rescaling to the output for a fair comparison. As RawNeRF pointed out, methods that train from "different sources (other than iPhone X)" tend to exhibit "color tints." For RawNeRF, the per-channel scaling was intended to address the "Camera Shutter Speed Miscalibration" described in section D.3 of the RawNeRF supplementary materials, since RawNeRF was trained using the "same source (iPhone X)". I would like to emphasize that the output of RawNeRF is almost identical in color to what is shown in Fig. 12(a) of their paper, and it does not exhibit the strong color shift presented in Fig. R1 provided by the author. On the other hand, I want to point out that if the author’s proposed self-supervised training method were truly effective, the results should be similar to those in Fig. 12(a) of the RawNeRF supplementary materials, given that it was also trained using the "same source" (during the reconstruction phase, not the pretraining phase). Therefore, I believe the paper suffers from a 'weak evaluation'.
- Regarding the detail loss in Fig. R1: There is significant detail loss, particularly in areas like the grass next to the statue in the second row. I do not believe this issue is due to a "Trade-off Between Preserving Sharp Details and Reducing Artifacts." Instead, I think it stems from the loss function proposed by the authors, which caused the denoiser to over-denoise the scene, resulting in oversmoothing (which also reduces artifacts). This assertion is based on the observation that other denoising methods, such as BM3D and Neighbor2Neighbor, retain more details in the "grass", indicating that a sufficiently good denoiser can preserve details.
During the upcoming reviewer-AC discussion phase, i may consider increasing my score, as i am currently the lowest scorer and also a most variant score. However, for now, I will maintain my stance, as I believe the paper's evaluation is insufficient.
Dear Reviewer m8Xp,
We sincerely thank you for your review and feedback.
Color bias
We believe that the ISP used in RawNeRF for training views are different from the ones we used, this shall be the cause of the color bias. The detailed reasons are as follows:
- Even the visualizations of the input images are different from the ones in RawNeRF as shown in Fig. 9 of their arixv paper.
- We observed the same color bias as shown in our Fig. R1 input images using
provided ISP w/o color correctiondefault viewer of Windows,default viewer of MacOS, andrawpy with in-camera white balance. - We achieved different and inaccurate white balances using
rawpy with its auto white balance.
Since the ISPs of training views are not released (the provided ISP for testing needs GT images for color correction as you mentioned), we will be grateful if you have any idea to obtain the same results in their paper and we will update the results in the revised version.
Even if there are differences in the usage of ISP in training views, we use the same provided ISP for the test views in the test scenes so all the results in the paper do not have this issue.
Loss of details
-
We acknowledge that some details are lost in the grass of the second row. However, as seen in other figures—including those in the manuscript and rebuttal PDF, such as the spokes of the bicycle wheel and the text on the board—we generally preserve details well.
-
The loss of detail in the grass in the second row of Fig. R1 might be coincidental. If it were due to the over-smoothing nature of our method, this effect would likely be visible in all other figures as well.
-
However, our method achieves better PSNR, SSIM, and LPIPS compared with HDR Scaffold-GS. The quantitive results demonstrate that our method can achieve overall better detail preserving, and over-smoothing is not a common behavior of our method.
-
By removing the needle-like artifact, we can significantly improve the rendering FPS, e.g., increases from 53 FPS to 134 FPS even for full-view training in Fig. R2 and ~4x FPS increase for 4-view training.
We hope this clarifies the unique contributions and impact of our work.
Best regards,
Authors
Thanks to All the Reviewers for the Insightful Comments
We would like to thank the reviewers for their efforts and insightful comments. We appreciate the reviewers’ acknowledgment of the novelty, performance, and presentation of our proposed method. For example:
- Reviewer JuHQ noted that "the paper is well written, the methodology is well motivated, and technically sound."
- Reviewer MTeY highlighted that "the paper includes extensive comparison ... and achieves supreme results compared to these baselines" and that "the whole structure is well-organized and complete."
- Reviewer h6sN pointed out the "better reconstruction and denoising performance compared to 3DGS" and praised the "good writing and presentation."
- Reviewer ZDPY appreciated the "detailed analysis of how noise impacts the optimization of 3DGS" and remarked that "the paper is well-written."
The questions or weaknesses mentioned by each reviewer are addressed separately in our responses. Please feel free to discuss with us if you have any further concerns or questions.
Following the discussion stage, the paper received four positive and one negative score. Several concerns were raised in the reviews, primarily related to the experiments. These concerns included the limited number of testing scenes, the lack of comparisons with RAWNeRF/3DGS, the need for retraining the state-of-the-art denoiser, and the results from sparse views. The rebuttal addressed these issues by providing additional experiments and results. Other concerns included the impact of the noise extractor, assumptions about diffuse reflection, and the complexity of the calibration process, all of which were also addressed in the rebuttal.
A significant point of discussion was the improvement compared to RAWNeRF. While the improvement is similar to that of RAWNeRF, it might not be crucial for the proposed method to surpass RAWNeRF, as NeRF and Gaussian splatting are fundamentally different approaches. Reviewer m8Xp expressed concerns regarding the grass section in the rebuttal PDF and noted color-shift artifacts. While these artifacts are present and could be improved, the proposed method remains valuable, especially considering that Gaussian splatting has recently gained prominence. The method enhances Gaussian splatting when using RAW images for dark scenes, which could benefit this emerging area.
Please incorporate the experiments and discussions from the rebuttal into the revised paper.