PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
5
6
6
6
3.5
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

OpenReviewPDF
提交: 2024-04-24更新: 2024-11-06
TL;DR

Fast training and real-time rendering for HDR view synthesis from noisy RAW images using 3DGS-based techniques.

摘要

关键词
3D Gaussian SplattingRAW imagesComputational PhotographyComputer vision.

评审与讨论

审稿意见
5

This paper proposes LE3D, an HDR 3D Gaussian Splatting method with Cone Scatter Initialization, Color MLP, and depth distortion. Specifically, this paper introduces the Cone Scatter Initialization to enrich the estimation of SfM. The Color MLP aims to represent the RAW linear color space. The goal of depth distortion and near-far regularizations is to improve the accuracy of scene structure. Experimental results show that the proposed LE3D achieves promising results.

优点

  1. The motivation of this paper is good. Nighttime 3DGS and HDR 3DGS are very important.
  2. The proposed method includes comprehensive experiments.
  3. The experimental settings are easy to follow.
  4. The video demo in the supplementary material is good.

缺点

  1. The authors mentioned, “the SfM estimations based on nighttime images are often inaccurate.” I agree with this point. However, from my understanding, the initial point clouds can be wrong and can be further optimized by the 3DGS training stage. Therefore, I am wondering if the cone scatter initialization is really important. Could the authors provide ablation studies using the RawNeRF dataset?
  2. 3DGS highly relies on the initial point clouds. One of the examples in Fig. 1 (a) shows that the input image can be very dark. I am wondering if the authors can get the initial SfM estimations from such inputs. From my understanding, COLMAP cannot work well on these dark inputs. If so, how do the authors get the initial SfM estimations?
  3. The authors mentioned, “the SH does not adequately represent the HDR color information of RAW images due to its limited representation capacity.” Why? Could the authors provide any solid evidence?
  4. The depth distortion (Eqn. 8) is quite similar to Eqn. 15 in Mip-NeRF 360. How does the proposed depth distortion differ from the depth distortion of Mip-NeRF 360? Could the authors conduct experiments to compare the two depth distortions?
  5. The authors seem to overstate their methods. Compared to RawNeRF, LE3D can reduce training time to 1%. However, the fast training speed is not actually due to the proposed components, but rather from 3DGS itself. If we compare it to the original 3DGS, the training speed of LE3D increased from 1.05 to 1.53, which is an improvement of approximately 45.71%. Since this is the first HDR 3DGS paper, I do not want to be too strict. But I suggest that the authors do not overstate the advantages in terms of speed.
  6. The quantitative results of the ablation studies should be shown.

问题

Please see the Weaknesses.

局限性

Yes.

作者回复

Thank you for recognizing the motivation, effectiveness, and thoroughness of our experiments on LE3D. We also appreciate your valuable comments! Below, we address the specific points.

  1. Does CSI really matter? We have done our ablation studies on CSI in our main paper, please refer to Fig. 4 (b) in our main paper and Fig. 1 in the rebuttal PDF. Compared to LE3D, the removal of CSI leads to a lack of distant scene details and the splitting of a large number of gaussians at incorrect depth levels to fit the distant scene, thereby resulting in failure in structural optimization. Additionally, there is a severe decline in visual quality (please zoom in for the best view). This is because the sparse point cloud cannot reconstruct distant points effectively, whereas CSI can effectively increase the initial point cloud in distant areas, achieving the goal of effective distant scene reconstruction.

  2. Initial SfM on Dark Scenes: We follow the exact settings of RawNeRF to obtain the initial SfM using COLMAP (except we use the PINHOLE camera model instead of OPENCV). In the RawNeRF dataset, COLMAP works well on calibrating camera poses but not very well on generating point clouds. The lack of point cloud initialization is also one motivation for our purpose of CSI initialization, which enriches the point cloud and extends its depth range. If the scene is too dark in other data, leading to poor COLMAP calibration, we brighten the scene using the DNG data to obtain corresponding JPG images, and then perform COLMAP calibration.

  3. Why SH is less expressive than MLP? As this is a common problem, please refer to the third section "What limits the final representation capability of SH on linear color space" in the global rebuttal for details.

  4. Depth Distortion: Our depth distortion map is an approximate implementation of equation (14) from Mip-NeRF 360 as we mentioned in the main paper L197-L198. Both of our regularizations are used to improve structure reconstruction. Please refer to Table 2 in the supplementary for a quantitative comparison of the effects of both RdistR_{dist} and RnfR_{nf}, and Fig. 6, 7 in the supplementary for qualitative comparisons.

  5. Overstated Performance: Thanks for pointing out this. We apologize and will adjust our wording accordingly in our next version.

  6. Quantitative Results of Ablation Studies: Please refer to Table 2 in our supplementary.

We hope our response addresses your concerns. You can find a reiteration of our motivation, contribution, and potential application in the first section of the global rebuttal.

评论

Thank you for your reply. I will maintain my score.

评论

We are sincerely grateful for your comprehensive review and the constructive feedback provided. Your acknowledgment of our paper's strengths and the thoughtful questions raised regarding its weaknesses provide valuable guidance for our revisions. We will address the concerns with additional clarity and precision in our updated version.

审稿意见
6

The paper LE3D proposes training a 3DGS representation with raw images, instead of preprocessed LDR images. This allows more accurate scene recovery in low-light environments and unlocks applications such as HDR rendering or refocusing. While there has been prior work on training neural scene representations with raw images (notably RawNeRF), this work specifically proposes to use a 3DGS representation which allows real-time rendering. To facilitate training a 3DGS representation with raw images, they make several contributions:

  1. Cone Scattering Initialization - a method to "densify" the Gaussian initialization to overcome inaccuracies in SFM
  2. Replacing the SH in 3DGS with a colour MLP
  3. Introducing space carving losses for better geometry recovery (for downstream tasks)

优点

Originality & Significance: I like that the paper brings several of the contributions in the NeRF literature to 3DGS, specifically I like the problem being attempted in the paper, I think HDR rendering and training with raw images is an important problem and it is timely to attempt this problem using a 3DGS representation. Losses such as the distortion loss from MipNeRF360 are widely used in the NeRF literature and I think it's valuable to have an implementation of them in the 3DGS context. Not a significant research contribution, but I also really liked the interactive viewer in the supplement, I think it would be fun to play around with the demo and see the weaknesses of the method.

Quality: I think the authors are fairly open about the limitations, for example mentioning NeRF-based methods obtain superior sRGB metrics.

Clarity: Paper is well organized.

缺点

Originality: I think although useful, the contribution of the work isn't huge, from Table 1, we can see RawNeRF + GS performs quite well already. I also believe the Related Works section is a bit sparse, there has been a lot of work on low-light image enhancement with NeRF, which I think is relevant (even though it's not about HDR rendering), see a couple below:

  1. Lighting up NeRF via Unsupervised Decomposition and Enhancement, ICCV 2023 - method for training NeRFs with sRGB images taken in the dark
  2. Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption, AAAI 2024 - method for training NeRFs with sRGB images taken in over/under-exposed scenarios

Quality: This is my biggest gripe with the work, some of the contributions are included without significant reasoning. Most importantly:

  1. The authors introduce the CSI and say it helps due to the misalignments in the SFM estimation. However, I don't think this claim is sufficiently well-proven. Perhaps a denser initialization helps even in cases with perfect cameras?
  2. The authors do not justify including the colour MLP, apart from saying it performs better than just using the SH (due to the linear scaling). I think this point is specifically important since I think the MLP is making the method 1.7 times slower in rendering (looking at Table 1).

For 1) I am sympathetic to the authors that this might be hard to justify, but I think 2) should be addressed (see also questions).

Clarity: There are quite a few typos/grammatical errors, I list just a few below, but I would implore the authors to use a spell check throughout the text:

L38: to the 3D world

L170: fed

L173: set

问题

  1. Have the authors tried just increasing the rate of the Gaussian splitting instead of CSI? Have they tried anything else apart from CSI for the gaussian density problem?

  2. Have the authors tried other strategies apart from a colour MLP? Perhaps the SH are also somehow introducing some bounds/weird gradient updates. Here's a thought: in NeRF if you bound your radiance when it shouldn't be, your density would also become heavily affected by the radiance (as it seems is happening in your case without the colour MLP), perhaps it could be something similar going on? How different is the per channel scaling behaving with/without the colour MLP?

局限性

N/A

作者回复

Thank you for recognizing our motivation and the structure of our paper, as well as your valuable comments! Below, we address the specific points.

  1. About the related work: we will add and discuss them in our next version. We will add more discussion for sRGB images (both w/wo multi-exposure) based on novel view synthesis techniques with your referred papers.

  2. Does RawGS (RawNeRF + 3DGS) perform well enough? We do not believe that RawGS (RawNeRF + 3DGS) performs well enough. The reasons are as follows: 1) Colour reconstruction failure & more floater remaining: As shown in Fig. 1 (left), Fig. 3, and Fig. 13-15 in our main paper and supplementary, these results indicate that RawGS performs worse than LE3D in terms of colour and floaters. 2) Structure reconstruction failure: As shown in Fig. 5 (c, e) and Fig. 13-15 in our main paper and supplementary, it is evident that RawGS fails to reconstruct the structure (depth map). This is disastrous for downstream tasks such as refocusing, as shown in Fig. 5 (b, d) in our main paper and Fig. 9 in supplementary. Therefore, we do not believe RawGS performs well enough, which is also the motivation for LE3D: to make 3DGS available on real-time HDR view rendering from noisy RAW images, as well as downstream editing tasks.

  3. Regarding CSI and dense point cloud initialization: We have previously conducted experiments using dense point clouds (this is also what we tried before coming up with the idea of CSI), which indeed partially solved the problem of failing to reconstruct distant scenes. However, it led to the following issues: 1) Longer time cost: The time required to reconstruct dense point clouds is four to five times that of sparse point clouds (from 6min 23sec to 28min 53 sec on scene 'bikes'), thereby increasing the overall reconstruction time. 2) Redundant gaussians when converged: An excessive number of points leads some gaussians to fit noise, which negatively increases the count of gaussians. For instance, in the 'bikes' scene, the number of points increased from 4.1e5 to 5.4e5, which is approximately a 30% increase, resulting in longer rendering time and more storge cost. So a balance between sparse and dense point cloud initialization is important, and then here comes our CSI. In fact, it is hard to find out the difference between the visual result of dense initialization and CSI initialization. We can provide proof of this.

  4. Regarding the speed issue between MLP and SH: We believe this balance is worthwhile because both structure and colour are important for downstream tasks, and the speed of LE3D remains real-time. More discussion on MLP and SH can be found in the third section "What limits the final representation capability of SH on linear colour space" in the global rebuttal.

  5. Regarding increasing the splitting threshold and the idea before CSI: We experimented with increasing the splitting threshold as shown in Fig. 1 (c) in the rebuttal PDF, but it did not effectively solve the issue of distant scenes and actually increased reconstruction noise. The reason it cannot effectively address distant scenes is that our dataset is forward-facing, and gaussians tend to move parallel to the camera plane rather than perpendicular to it. Therefore, if there are not enough gaussians for distant scenes during initialization, additional gaussians will not appear spontaneously. We observed this phenomenon based on previous experiments with dense point clouds. Dense point clouds can reconstruct points in the distance, but this also impacts training time and the number of gaussian. Our CSI effectively balances the number and distance of points.

  6. Regarding colour representation besides MLP and the co-optimization of colour and structure: Currently our choice of MLP proves sufficiently effective with its learning ability, as shown in Fig. 1 (d) and Table 2 in the rebuttal PDF. However, we agree with the reviewer that trying other strategies is interesting future work. In vanilla 3DGS (with SH), there is indeed a phenomenon where colour and structure do not optimize well together. As we mentioned in the ablation study (Fig. 4 (e) in main paper), gaussians with SH in the early stages lead to colour optimization failure, which in turn causes structure optimization failure before 15,000 iterations (during which gaussian primarily optimize structure through split/clone). This, in part, limits the final colour performance of SH. Please refer to the third section "What limits the final representation capability of SH on linear color space" in the global rebuttal for details. Besides, the per channel scaling behaving with/without the colour MLP can also be found in Fig. 1 (d) in the rebuttal PDF.

We hope our response addresses your concerns. You can find a reiteration of our motivation, contribution, and potential application in the first section of the global rebuttal.

评论

I thank the reviewers very much for their response.

2) Thank you for stressing the refocusing results, I think these indeed show that RawGS just can't perform some tasks, despite not performing badly on NVS metrics.

3, 5) I think the author's response and Fig 1. in the rebuttal adequately show the importance of CSI. I would implore the authors to add this discussion to the paper/supplement.

4, 6) I think the authors misunderstood me slightly, I'm not saying the MLP is not necessary, I guess I just wanted to know what other alternatives the authors have tried. I think this is important since as I mentioned the MLP brings about a 1.7 times reduction in rendering speed, which is the main point of the paper.

I appreciate the figures the authors have included in the general rebuttal about the statistics of the SH-derived colors. I think they beg the question, what if an activation was used on top of the radiance prediction from the SH? RawNeRF uses an exponential activation to increase the dynamic range. I think this might serve a similar purpose here, increasing the dynamic range and also making the outputs non-negative? Does that make sense?

I think I'm replying quite late, and the authors might not have enough time to address this, apologies for this. Although I would love to see this experiment before the end of the discussion phase, I understand this might not be enough time and would then love to see it for the camera ready.

评论

Thank you so much for acknowledging our response and for the helpful suggestions regarding the SH with exponential activation experiments. We'd like to clarify a few points:

  1. Apologies for the confusion on Question 2: We're sorry for any misunderstanding earlier. We believe that the MLP greatly enhances performance, and the benefits outweigh the extra computational costs. The improvements in color and structure are crucial for LE3D's flexibility in downstream tasks and for better visual results. Importantly, the MLP version of LE3D still operates in real-time, so we think switching from SH to MLP is a good trade-off.
  2. SH with Exponential Activation Additional Experiment: We appreciate your insightful recommendation. Early in our project, we did try combining SH with exponential activation after noticing that vanilla SH produced negative values. While this approach did help with some of the issues SH has with HDR scene reconstruction, our experiments showed that it didn’t fully address SH's inherent limitations. For example, in the 'windowlegovary' scenario, the SH+EXP method showed a maximum value above 1e3 on one channel while staying below 10 on others, highlighting SH's expressive limitations. Additionally, we observed color issues similar to those in the 'yuccatest' scenario, as shown in Fig. 7 of our supplementary material.

We’re planning to include these discussions, additional figures, and a more detailed response (regarding CSI and the MLP/SH comparison) in our next version. We really appreciate your feedback and acknowledgment.

Please feel free to reach out with any more questions or requests for additional experiments. If our clarifications have addressed your concerns, we’d be very grateful if you could consider adjusting your score to reflect the improvements and the efforts we've put into refining our paper.

评论

Thank you very much for the response. I am satisfied with the discussion provided by the authors. I would love to see some quantitative/visual results for the above (SH + exp) in the supplement, which I think would be interesting. Otherwise, I am happy to increase my score.

评论

We are grateful for your satisfaction with our discussion and appreciate your willingness to increase the score. We will ensure to include the quantitative and visual results for SH + exp in the supplement as suggested. Thank you once again for your comprehensive review and the time you have dedicated to evaluating our work.

审稿意见
6

LE3D is a novel method for real-time novel view synthesis, HDR rendering, refocusing, and tone-mapping changes from RAW images, especially for nighttime scenes. It addresses the limitations of previous volumetric rendering methods by introducing Cone Scatter Initialization to improve SfM estimatied pointclouds, replacing SH with a Color MLP to represent in the RAW linear color space, and introducing depth distortion and near-far regularizations to enhance scene structure accuracy. LE3D significantly reduces training time and increases rendering speed compared to previous methods. The model is tested on the benchmark dataset of the existing RawNeRF paper.

优点

  1. The paper is well written and has a clear structure that aligns to the contribions which makes it easy to read and understand.
  2. The methods part is technical sound and details are described.

缺点

  1. Instead of using the Spherical Harmonics the authors propose a view-dependence MLP and explain the change that the MLP can better represent the the linear RGB space, however it is unclear to me why the SH are less expressive than a tiny MLP in the linear RGB space. Can the authors provide more insights on this.
  2. Insufficient quantitative investigations of the new view-dependence MLP. I'd expect more evaluation of this part, e.g. the varying size of the MLP and different parameters of the SH model. Moreover it would be good to the effect of switching to MLPs on the render time.
  3. While the cone scatter initialization leads better rendering, it would be good to discuss how the additional sampling point influence the total number of gaussians. This number becomes relevant for the fps measurements and might reduce speed.

问题

Please discuss the points mentioned in the weaknesses.

局限性

Sufficiently discussed.

作者回复

Thank you for your recognition of our writing and your valuable comments! Below, we address the specific points.

  1. Request for more insights on MLP and SH: We have analyzed the instability factors of SH during training and provided more statistical data to demonstrate its limited representation capability under high contrast conditions. Please refer to the third section "What limits the final representation capability of SH on linear color space" in the global rebuttal for details.

  2. Request for more ablation studies on MLP and SH: We have conducted experiments on the size of MLP and the degree of SH. Please see Table 2 in the rebuttal PDF. The experiment demonstrates that modifying the parameters of SH does not address the inherent issues, which underperforms MLP across all metrics. Moreover, adjusting the parameters of the MLP has minimal impact, given the small disparity in the metrics. Consequently, transitioning to an MLP is entirely justified.

  3. Detailed results on gaussian numbers for w/wo CSI: In some scenes (especially bikes and windowlegovary), the sparse point cloud fails to reconstruct distant points accurately. During training, gaussians tend not to move perpendicular to the camera plane but rather parallel to it. This results in many foreground gaussians attempting to fit the background from different perspectives, as shown in Fig. 4 (b) in our main paper, leading to more foreground split/clone operations and thereby increasing the number of gaussians and leading to worse results in structural reconstruction. CSI is used to add sparse points for distant scenes, and due to the optimization capabilities of 3DGS, in all scenarios, CSI not only did not increase the number of points, but actually reduced them. Besides, CSI is particularly effective in the outdoor scene (-24.86% for bikes and -13.10% for windowlegovary). Details can be found in Table 1 in the rebuttal PDF.

We hope our response addresses your concerns. Here, I would like to reiterate the potential impact of LE3D: real-time rendering for HDR view synthesis is important because it can support more downstream editing tasks, allowing novel view synthesis technology to be applied to VR/AR or post-production in film and television (reframing and postprocessing). As demonstrated in our demo video, our current progress already supports various subsequent processing techniques in near real-time. These techniques bring LE3D closer to practical applications. Details for more about our motivation can be found in the first section of the global rebuttal.

评论

Dear authors,

Thank you , I appreciate your efforts in answering my concerns. The explanation regarding the SH and MLP representation makes sense to me and it supports the respective claim in the paper. Further, I want to thank for the extensive ablation study on the appearance representation and the CSI. After reading all reviews and the rebuttal, I increase my rating to weak accept and encourage the authors to add the additional ablation studies to the paper/supplementary.

Best Reviewer PtE5

评论

Thanks for your thoughtful feedback and for adjusting the rating in light of our responses. We are encouraged by your support and will certainly incorporate the additional ablation studies into our paper or supplementary material as suggested. Thank you once again for your constructive criticism and valuable feedback.

评论

Dear Reviewer PtE5,

We hope this message finds you well. As we approach the culmination of the discussion period, we would like to extend our heartfelt thanks for your insightful feedback and the time you have invested in evaluating our work.

We are keen to ensure that all your queries and concerns have been satisfactorily addressed. Should there be any remaining points that you believe require further clarification or discussion, we kindly urge you to share them with us at your earliest convenience.

It is our sincere hope that our responses thus far have successfully clarified the issues you raised. If you find that your concerns have been resolved, we would be immensely grateful if you would consider adjusting your score to reflect the improvements and the efforts we have made to refine our paper.

Thank you once again for your time and consideration.

Warm regards, The Authors

审稿意见
6

The authors aim to leverage 3D Gaussian Splatting with a few additions and changes in order to perform HDR view synthesis. The authors propose that with the addition of cone-scattering to the Structure from Motion initialization, replacement of Spherical Harmonics with a simple MLP for color representation, and extra distance regularizations, an HDR scene may be very quickly optimized to accurately render novel views in real-time. The approach is evaluated on the data used by RawNeRF (CVPR 2022).

优点

The paper proposes a reasonable set of additions in order to adapt 3DGS for the purpose of HDR scene reconstruction and view synthesis. Though the submission emerges alongside other comparably-aimed papers, the approach proposed is distinct and its authors provide sufficient evidence of the efficacy of their method.

  • Originality It is probably not a very distant leap to apply 3DGS to avenues of work done on NeRF, though this work does go further to involve a few novel additions to the method being adapted, in order to address shortcomings that would be present otherwise. Other works with the same aim appear to be emerging at this time, though the approach taken by the authors, affecting the gaussian initialization and performing distance regularizations, appears distinct.

  • Quality With consideration to the supplementary materials, the submission adequately covers most points of concern. Claims made are supported by the provided experimental results. Included ablation study results show qualitative and quantitative contribution for each improvement proposed by the authors.

缺点

  • Clarity For the most part, the writing is clear and straightforward. The context for previous efforts improved upon and problems being solved is well provided. Relevant equations are present and adequately described. Implementation and environment details are well expanded upon in the supplementary materials. However, there are a small number of typos remaining in the submission and supplement. Please make sure to correct these.

  • Significance The paper is one of several that aim to introduce HDR Gaussian Splatting, though with regards to other work released within this month and the last, no code has been released as of yet, so no direct comparisons between results may be made. However, the stated improvements relative to these other works appear to come out on top.

问题

It is not explicitly written, but could it be appropriate to briefly address directly the tradeoff in FPS/Training time with accuracy in regards to the comparison between the proposed method and compared 3DGS methods?

局限性

The supplementary material sufficiently addresses the limitations and potential negative societal impact of the submitted work.

作者回复

Thank you for recognizing our structural regularization and the thoroughness of our experiments! Below, we address the specific points.

  1. About the typos: thanks for pointing them out! We will definitely fix them in the next version of our paper.

  2. About the difference between LE3D and concurrent papers: we have discussed the differences between LE3D and other methods in the global rebuttal part, please refer to details there. The main differences between LE3D and other methods lie mainly in two aspects: a) LE3D can perform blind denoising on the data without the need for noise model calibration, which broadens its range of applications. b) LE3D places more emphasis on structural information, making it more suitable for downstream editing tasks.

  3. The tradeoff in FPS/Training time: While our assertion may seem somewhat overstated, we confidently advocate for the consistent selection of LE3D, with the disclaimer that this comparison is made relative to LDR-GS, HDR-GS, and RawGS as discussed in our main paper. The main reasons are as follows: a) Visual quality: As shown in the main paper and supplementary figures, LE3D exhibits less noise compared to other methods (e.g., floaters in Fig. 1, 3) and better color shown in Fig. 13. b) Structural information: LE3D captures structural details better than other methods and much better than RawGS (Fig. 13, 14, 15), which benefits downstream tasks, as demonstrated in the defocus tasks shown in Fig. 5 and 9.

We hope our response addresses your concerns. You can find a reiteration of our motivation, contribution, and potential application in the first section of the global rebuttal.

评论

Dear Reviewer Me2T,

Thank you for your recognition and for offering the first 'weak accept' for our LE3D! We will certainly include a discussion on the differences between LE3D and other concurrent works in the next version.

As the discussion period draws to a close, we would like to ensure that all your queries and concerns have been fully addressed. If there are any points that require further clarification or additional information, we are more than happy to provide it.

Should you find that our responses have satisfactorily resolved your reservations, we would be deeply grateful if you could consider adjusting your score to reflect the improvements.

Best regards, The Authors

作者回复

First and foremost, we would like to express our gratitude to all the reviewers and the Area Chair for their diligent review. We sincerely appreciate your recognition of our writing, your appreciation of the effects shown in our demo video, and your identification of typos and weaknesses in our paper. We look forward to improving LE3D with all of your help and making it a better paper, project, or maybe a product.

However, given that the confidence scores from our reviewers are predominantly 3, we wish to reiterate to the reviewers and the ACs the motivation, contribution, and potential impact of our LE3D.

Reiterate the motivation, contribution, and potential impact. We are aiming at real-time HDR view rendering and downstream editing tasks. With the recent advancements in 3DGS-related work, we attempt to leverage its capabilities to achieve real-time rendering and fast training. However, directly applying 3DGS techniques to noisy raw images presents numerous challenges, such as poor SfM estimation, limited representation capacity of SH and inaccurate structure reconstruction. The primary contribution of LE3D is addressing these issues, enabling 3DGS to excel on noisy RAW images. It enables tons of applications for post-processing, including HDR rendering, refocusing, tone-mapping changes and so on.

Our potential impact lies in the applications of AR/VR technology and related post-production techniques in film and television (reframing and postprocessing). Additionally, it can bring traditional computational photography techniques into the realm of 3D reconstruction. As demonstrated in our demo video, our current progress already supports various subsequent processing techniques in near real-time: HDR tone mapping, refocusing, and color temperature tuning. Therefore, we believe that the potential impact of LE3D is substantial.

The Difference Between LE3D and Other Concurrent Methods. As discussed in the introduction, current HDR reconstruction methods mainly include two approaches: 1) reconstruction from multi-exposure 8-bit RGB images, and 2) direct reconstruction from noisy under/multi-exposure data. The first approach has high data requirements, necessitating changes in exposure during capture, while the second approach needs to overcome the impact of noise.

The most recent concurrent 3DGS methods on similar issues include HDR-GS[1] (RGB), HDRSplat[2], Raw3DGS[3] (RAW). Among them, HDR-GS[1] does not start with RAW data and is of different settings from our method. HDRSplat[2] and Raw3DGS[3] also has substantial differences in their algorithm framework from ours:

  1. The Necessity of a Denoising Network: Both HDRSplat[2] and Raw3DGS[3] adopt multi-stage approaches, requiring noise model calibration, denoising, and reconstruction after data collection. This necessitates extra data, training, and tedious noise model calibration when transferring their methods to other devices. In contrast, though we do not have specific module for denoising, our LE3D algorithm is blind to noise, making it potentially adaptable to any device.

  2. Better structure for downstream tasks: Unlike other methods, our focus is not only on the reconstruction of visual effects (denoising) but also on the reconstruction of scene structure, due to our CSI initialization and regularizations. This makes our LE3D more suitable for downstream tasks like refocusing.

[1] Cai, Yuanhao, et al. "HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting." arXiv preprint (2024).

[2] Singh, Shreyas, Aryan Garg, and Kaushik Mitra. "HDRSplat: Gaussian Splatting for High Dynamic Range 3D Scene Reconstruction from Raw Images." arXiv preprint (2024).

[3] Li, Zhihao, et al. "From Chaos to Clarity: 3DGS in the Dark." arXiv preprint (2024).

Next, we will elaborate on a common question raised by several reviewers:

What limits the final representation capability of SH on linear color space. The answer is the extremely high dynamic range in linear color space.

  1. We analyze the final convergence statistics for windowlegovary (multi-exposure training with a very high dynamic range). We calculate the mean and variance of each gaussian color from each training view, as shown in Fig. 1 (d) in the rebuttal PDF. It can be observed that: a) The numerical range using color MLP can learn a very high dynamic range in linear color space (maximum value over 200), while the numerical range using SH collapses completely (contains negative values) and fails to learn the especially high dynamic range in linear space. b) Additionally, the final color using SH is significantly affected by the viewing direction (as reflected in the extremely large variance), which is also a manifestation of its instability and finally causes the less color representation ability in linear color space.

  2. As for the training process, the SH cannot optimize the color well in the early stages of training, as shown in Fig. 4 (e) in the main paper. This also leads to poor structural optimization before 15,000 iterations (before 15,000 is where 3DGS is used for densifying/cloning), as illustrated in Fig. 4 (c) in the main paper. Poor structure causes color optimization to fail after 15,000 iterations (at this stage, 3DGS mainly optimizes color and does not perform densifying/cloning), further affecting the visual effects, as shown in Fig. 1 (left) in the main paper and Fig. 7 in the supplementary.

Both of the above situations affect the representation capability of SH at final convergence. Details can be found in Table 2 in the rebuttal PDF, which shows MLP always outperforms SH by a large margin.

评论

Dear Reviewers,

First and foremost, we would like to express our sincere gratitude for your recognition and constructive feedback on our paper. Your insights have been invaluable in enhancing the quality of our work.

As the discussion period is nearing its end, we would like to take this opportunity to check if there are any responses that have not adequately addressed your queries or concerns. We are more than willing to provide further clarifications or additional information to ensure that all points are thoroughly covered. Thanks for all reviewers who have provided timely responses.

Our goal is to make LE3D a better paper, and we sincerely hope that our revisions and responses thus far have been able to alleviate any doubts you may have had. Should you find that our answers have satisfactorily dispelled your reservations, we would be profoundly grateful if you would consider raising the score to reflect the improvements made.

Best regards, The Authors

最终决定

The paper introduces a new framework that unifies rendering and inverse rendering tasks using a two-stream diffusion model. This approach is innovative and has potential implications for future work in image generation and editing. The paper is well-structured, with solid experimental results demonstrating the effectiveness of the model. However, several concerns were raised by the reviewers, primarily focusing on the limited dataset used for training and the lack of evaluation on real-world data, which raises questions about the model's generalization capabilities.

The paper's handling of lighting estimation was also critiqued, with reviewers noting the need for more comprehensive evaluation in this area. Additionally, the comparisons with state-of-the-art methods were considered somewhat outdated, and reviewers suggested including more recent work to strengthen the evaluation. Despite these concerns, the authors provided detailed rebuttals, addressing most issues and expressing their intention to improve the dataset in future work.

Ultimately, while the paper has some limitations, its novel approach and clear presentation make it a valuable contribution to the field. It has been recommended for poster acceptance, with the expectation that the authors will continue to refine their work and address the highlighted concerns in future iterations.