DN-4DGS: Denoised Deformable Network with Temporal-Spatial Aggregation for Dynamic Scene Rendering
摘要
评审与讨论
This paper proposes a novel framework for modeling dynamic scenes. To address the noisy 3D Gaussians in canonical space, the authors introduce a Noise Suppression Strategy (NSS) with a two-layer Deformation Field to resolve this issue. Additionally, they propose a Decoupled Temporal-Spatial Aggregation Module (DTS) to efficiently aggregate neighboring temporal features (for the first deformation) and k-nearest neighbors' spatial features (for the second deformation). Experiments on real-world datasets show substantial quality improvements compared to D-4DGS [1].
[1] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
优点
This paper aims to improve the capability of 3DGS-based dynamic reconstruction methods in modeling real-world scenes. The proposed Temporal Aggregation and Spatial Aggregation are sensible, and the detailed ablation studies in Tab. 4 demonstrate the effectiveness of each component. In terms of rendering metrics, the experimental results in this paper surpass the previous state-of-the-art 3D-GS-based methods in real-world scenes.
The design of the dual-deformation network and the experiments with strong baselines are appreciated.
缺点
- I think the following papers should also be cited:
- (CVPR 2024) Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis by Zhan Li et al.
- (ICLR 2024) Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting by Zeyu Yang et al.
- Issues with the HyperNeRF dataset: I believe that the visualization comparisons on the HyperNeRF dataset should include HyperNeRF [2], 4DGS [3], and even Deformable-GS [4]. Based on my understanding of the HyperNeRF dataset, its camera poses are not accurate. This leads to a slight misalignment between the rendered images and the ground truth due to the inaccurate camera poses. Most current image quality evaluation metrics, such as PSNR and SSIM, heavily penalize slight misalignments more than blurriness. For example, in the Fig. 7
Chickenscene, I found that both D-4DGS [1] and DN-4DGS exhibit blurriness under the sparse Init setting. I am confident that HyperNeRF and Deformable-GS can render relatively sharp images, but slight misalignments with the ground truth caused by inaccurate poses result in these methods having much lower rendering metrics compared to some blurrier methods. (In Tab. 2, HyperNeRF, which has higher rendering quality than D-4DGS, has lower rendering metrics.) Therefore, I will not rely heavily on the quantitative metrics of the HyperNeRF dataset. The visual results are more important, and comparisons with more previous state-of-the-art methods are needed to demonstrate the effectiveness of the proposed method. - Selection of the baseline. I do not think D-4DGS [1] is an appropriate baseline. The essence of why the Deformation Field (outputting delta) works is due to its sufficiently smooth output. D-4DGS's Hexplane clearly disrupts this smoothness, which is why it performs poorly on the D-NeRF dataset. In real-world scenes, due to the high sensitivity of 3D Gaussian explicit representation to inaccurate camera poses, it does not have a significant advantage over NeRF-based methods (like HyperNeRF). I fully agree with the authors' attempt to improve the modeling capability of Deformation-based Gaussian splatting, but I think the authors need to explore more fundamental reasons, such as the accuracy of camera poses and the negative optimization effects of D-4DGS's strong regularization of deformation in real-world scenarios. Methods like 4DGS [3] and spacetime-GS [5], which decouple time features into each GS, can better handle the inaccuracies of camera poses in real-world scenes
[2] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021.
[3] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642.
[4] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101,2023.
[5] Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2023. Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis. arXiv preprint arXiv:2312.16812 (2023).
问题
A minor question: I would like to know if the authors have tested the effectiveness of their method on the D-NeRF dataset.
局限性
Please refer to the weakness.
-
Thank you for your suggestions. We will cite the relevant works as recommended.
-
We appreciate your understanding of HyperNeRF and the insights regarding both HyperNeRF and Deformable-GS. In response, we have conducted additional visual comparisons in Figure 2 of the attached PDF, including HyperNeRF and Deformable-GS. It is worth emphasizing that our method is also applicable to Deformable-GS. As shown in Figure 3 (b) of the attached PDF, while the canonical GS in Deformable-GS performs better than D-4DGS, it still contains noise. Consequently, we not only present results for D-4DGS + our method but also for Deformable-GS + our method. The visual results demonstrate that our approach provides notable improvements.
-
Thank you for your suggestions. In fact, our goal is to design our method as a plug-and-play solution. Therefore, we have conducted experiments using both D-4DGS and Deformable-GS as baselines. Specifically, Tables 1 and 2 in the main text present results with D-4DGS as the baseline, while Table 3 in the main text includes results with Deformable-GS as the baseline. To provide further insights, we have also shown results for Deformable-GS + our method on the HyperNeRF dataset in Figure 2 of the attached PDF. Additionally, we present results for SpaceTime-GS + our method on the PlenopticVideo dataset in the table below, highlighting the effectiveness of our two-stage deformation strategy (NSS) and spatial aggregation process (DSAM).
Method PSNR LPIPS D-4DGS 31.15 0.049 D-4DGS+ours 32.02 0.043 spacetime-GS 32.05 0.026 spacetime-GS+ours 32.42 0.023 We appreciate your suggestions on exploring new directions and acknowledge the importance of camera pose accuracy for deformation-based Gaussian splatting. We also recognize the potential negative impacts of strong deformation regularization in D-4DGS. As a result, we are also actively investigating methods to optimize camera poses through network feedback and exploring new ways to better represent deformation fields.
-
Our work initially focused on improving dynamic reconstruction in real-world scenarios, which is why we did not extensively test on the synthetic D-NeRF dataset. However, we appreciate your reminder, as testing on D-NeRF is indeed valuable for demonstrating the generalization and completeness of our method. Consequently, we have conducted experiments on D-NeRF as well. As illustrated in the Figure 3 (a) of the attached PDF, we have visualized the canonical results for both D-4DGS and Deformable-GS. The results show that Deformable-GS has less noise compared to D-4DGS, but noise is still present in moving areas. The quantitative results on D-NeRF dataset, as shown in the table below, indicate that our method improves performance on both D-4DGS and Deformable-GS, with enhancements on D-4DGS surpassing those on Deformable-GS. | Method | PSNR | SSIM | | --- | --- | --- | | D-4DGS | 34.05 | 0.9787 | | D-4DGS+ours | 34.53 | 0.9811 | | Deformable-GS | 39.51 | 0.9902 | | Deformable-GS+ours | 39.87 | 0.9922 |
Thanks so much for authors' response and the additional experimental results based on the review. According to the rebuttal results, I understand that the essence of the work is to reduce noise in the canonical space of deformation-based Gaussian splatting to improve rendering quality. However, I believe this viewpoint is not valid.
In the current deformation-based Gaussian splatting, there is a crucial detail: there is no assumption that delta=0 at t=0, meaning the canonical space does not necessarily represent the state at t=0. Therefore, the canonical space is the result of free optimization, and introducing a zero-canonical constraint could actually reduce rendering quality.
I'm not sure if authors have conducted zero-canonical experiments. If authors' starting point is to reduce noise in the canonical space, I believe the improvements of DN-4DGS are more due to the increase in MLP layers (equivalent to deepening the deformation field layers). Therefore, to prove authors' point, authors should provide a comparison between the baseline method with an MLP of the same depth and the zero-canonical constraint.
Thank you for your suggestion. In fact, although our proposed method involves two deformation operations, we decrease the number of MLP layers in the design process to reduce computational complexity. To substantiate our claim, we have compiled the parameters of the deformation networks corresponding to different baselines as well as the baselines combined with our method.
| Method | Parameters |
|---|---|
| D-4DGS | 3.38M |
| D-4DGS + ours | 3.41M |
| Deformable-GS | 0.52M |
| Deformable-GS + ours | 0.39M |
- D-4DGS: Since most of the parameters in D-4DGS are composed of the HexPlane and deformation head, our method introduces an additional deformation operation, leading to a slight increase in the number of parameters compared to the baseline. However, the increase is only by 0.03M parameters.
- Deformable-GS: Deformable-GS primarily consists of an MLP-based deformation network. To reduce the computational load, our method halves the number of MLP layers, resulting in fewer overall parameters compared to the original Deformable-GS.
In summary, the performance improvements brought by our method are likely due to the design itself rather than an increase in the number of parameters.
Regarding the zero-canonical constraint, we have conducted relevant experiments, and the results show that this constraint indeed leads to performance degradation. We believe this is due to two main reasons:
- The zero-canonical constraint cannot ensure
delta = 0att = 0. Even with the zero-canonical constraint, after training, there is still significant noise in the dynamic region of the canonical Gaussian, which affects the prediction of the deformation network. - The zero-canonical constraint increases the difficulty of training the network and may overshadow the main objectives, leading to incomplete training of the reconstruction loss and subsequently affecting the network's performance.
Thanks so much for authors' prompt response. I still find it difficult to understand the authors' explanation from the perspective of the canonical space regarding the improvements DN-4DGS brings to deformation-based Gaussian splatting. The canonical space in the Deformable-GS, as shown in the rebuttal PDF, is noticeably better than in Fig. 8, so I don’t believe the improvements are due to reduced noise in the canonical space. I think it is more likely due to the increased spatial and temporal smoothness, which is the essential reason deformation-based Gaussian splatting works.
As demonstrated in the rebuttal, there is a significant gap between D-4DGS and Deformable-GS on the D-NeRF dataset (34.05 vs. 39.51), which is because the Hexplane disrupts the smoothness of the deformation in both space and time.
Regardless, while the authors' explanation for the effectiveness of their work doesn't fully satisfy me, the method itself does contribute to the field of deformation-based Gaussian splatting. Therefore, I have decided to raise my score to borderline accept.
The authors may consider exploring alternative explanations for the work's effectiveness, rather than focusing on the canonical space, in the release version.
Thank you for your prompt response and for ultimately raising your score. We also appreciate your suggestions regarding our work, and we fully acknowledge the importance of spatial and temporal smoothness in deformation-based Gaussian splatting. In fact, our method can indeed be viewed from the perspective of enhancing both spatial and temporal smoothness. The TAM in the first stage is designed to aggregate features along the temporal dimension, thereby improving temporal smoothness. Our two-stage deformation strategy is aimed at providing more accurate local spatial adjacency relationships in the second stage of spatial aggregation, which in turn enhances spatial smoothness. Finally, we sincerely appreciate your recognition of our work's contribution to the field of deformation-based Gaussian splatting. We will consider adding an explanation from the perspective of smoothness to highlight the effectiveness of our method in the release version.
The author proposed a Denoised Deformable Network to enhance the rendering performance of 4D-GS. In this network, a two-stage deformation prediction method was introduced to suppress noise and achieve better performance.
优点
Based on the performance on three dynamic datasets, the proposed method outperforms the compared methods.
缺点
-
Novelty: The author claims that the novelty of the proposed method lies in suppressing noise to achieve more accurate deformation prediction. However, from the figures and descriptions, it appears to only apply a two-stage deformation prediction without any denoising operations.
-
Lack of detailed explanation of the method.
-
Insufficient comparison with other methods and related work.
问题
Method: 1. The description of the method is unclear. In section 4.3.1, the feature is obtained by concatenating , and . However, and have different dimensions. It is not explained how these are concatenated or what the shape of is.
2.During optimization, why is an additional embedding included? What is its purpose? For instance, in HexPlane, pairs of 2D planes are initialized to represent specific spatial or spatiotemporal axes. How does function in this context?
- In the DSAM (section 4.3.2), the operation is similar to that in section 4.3.1. There is no direct denoising operation, only a two-stage deformation prediction.
Related Work and Comparison: Many new works have appeared in CVPR 2024 in terms of dynamic NeRF and Gaussian diffusion.
Ablation study: From the methods section, it appears that the NSS includes both the TAM and DSAM. How, then, can it be part of the ablation study?
局限性
The authors have addressed the limitations.
1. Novelty
The primary motivations of this paper are twofold:
-
Noise in Canonical + Deformable Design: In the canonical + deformable design, point-to-point relationships within the canonical Gaussians are chaotic and erroneous (noisy), which can be transferred into the deformation field. To address this issue, we propose the Noise Suppression Strategy (NSS). This strategy involves two stages of deformation to mitigate the impact of noise on the deformation network. During the first stage, we supervise only the initial deformation. Once the Gaussian coordinates from the first deformation are relatively accurate, we introduce the second deformation and shift the supervision to it. The rendering quality can be progressively improved through this approach.
-
Lack of Explicit Temporal-Spatial Feature Aggregation: Previous methods lacked explicit aggregation of temporal and spatial features. We address this by proposing the Decoupled Temporal-Spatial Aggregation Module (DTS), which includes:
- Temporal Aggregation Module (TAM): Handles the temporal aggregation process.
- Denoised Spatial Aggregation Module (DSAM): Manages spatial aggregation.
Given that inaccuracies in coordinate relationships can be amplified during spatial aggregation (KNN), we decouple temporal and spatial aggregation, placing them in two distinct deformation stages. Since the Gaussian coordinates obtained after the first deformation are relatively accurate, we perform spatial aggregation in the second stage to avoid the amplification of noise due to inaccurate spatial relationships. We name this module Denoised Spatial Aggregation Module (DSAM) because the Gaussian coordinates input into DSAM are more accurate (denoised), as shown in the fourth column of Figure 8 in the main text. The term "Denoised" is used to highlight the improved accuracy of the inputs rather than implying that DSAM performs denoising.
2. Method
- Concatenation in the Formula: In this formula, we concatenate with and , rather than . We concatenate along the second dimension to obtain .
- Purpose of : The primary purpose of is to add a coordinate-independent embedding to improve quality by accurately modeling different deformations of various Gaussians without adjacency interference. We have also observed a similar design in a recent work, Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting (E-D3DGS) [1], presented at ECCV 2024.
- Clarification on DSAM: We apologize for any misunderstanding caused by our previous description. In fact, DSAM itself does not perform denoising but is focused on spatial feature aggregation. The name Denoised Spatial Aggregation Module (DSAM) was chosen because the Gaussian coordinates entering DSAM are more accurate (denoised) after the first deformation, as illustrated in the fourth column of Figure 8 in the main text. Hence, we prepend "Denoised" to emphasize the improved accuracy of the inputs rather than suggesting that DSAM performs denoising.
3. Related Work and Comparison
Thank you for your suggestions. We will incorporate references to and discussions of the following papers in the related work section. Additionally, we have compared these approaches using the PlenopticVideo dataset.
| Method | PSNR | LPIPS |
|---|---|---|
| 4DGS[2] | 32.01 | 0.014 |
| E-D3DGS[1] | 31.42 | 0.037 |
| 4DGaussian[3] | 31.15 | 0.049 |
| 4DGaussian+ours | 32.02 | 0.043 |
| spacetime-GS[4] | 32.05 | 0.026 |
| spacetime-GS+ours | 32.42 | 0.023 |
4. Ablation study
Here's a detailed explanation of the ablation study for Table 4:
- The second row refers to using two-stage deformation operations, but both stages of feature extraction networks still utilize the ordinary deformation network.
- The third row indicates that we do not perform two-stage deformation operations. Instead, we directly replace the ordinary deformation network from the first row (baseline) with our proposed TAM.
- The fourth row means that we replace the feature extraction network used in the first stage of the two-stage deformation operations (from the second row) with our proposed TAM.
- The fifth row specifies that we do not use two-stage training. Instead, we couple temporal and spatial aggregation into a single process, training TAM and DSAM simultaneously throughout the entire process.
- The last row shows that we replace the feature extraction network used in the second stage of the two-stage deformation operations (from the fourth row) with our proposed DSAM.
[1] Bae J, Kim S, Yun Y, et al. Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting[J]. arXiv preprint arXiv:2404.03613, 2024.
[2] Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642.
[3] Wu G, Yi T, Fang J, et al. 4d gaussian splatting for real-time dynamic scene rendering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 20310-20320.
[4] Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2023. Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis. arXiv preprint arXiv:2312.16812 (2023).
Thank you for providing a detailed explanation and additional experiments. The rebuttal addressed my concerns and clarified my questions.
In the "Related Work and Comparison" section, I noticed that although the proposed method demonstrates superior performance in terms of PSNR, the LPIPS score is not as strong. Could you please provide a more in-depth explanation for this discrepancy?
Overall, based on the rebuttal, I have decided to raise my score to borderline accept.
Thank you very much for your professional evaluation and willingness to raise your score! PSNR primarily compares the mean squared error between the original image and the reconstructed image. It's effective at capturing overall differences (For example, if the reconstructed image is missing an object or includes an extra one, the mean squared error will be higher, resulting in a lower PSNR value). However, PSNR isn't as effective at capturing fine details or distinguishing blurred images. On the other hand, LPIPS uses deep learning for comparison, which excels at extracting structural information and capturing local details in the image.
Our approach improves the deformation network by implementing a two-stage deformation process and spatiotemporal information aggregation, leading to significant enhancements in modeling dynamic regions. As shown in Figure 6 of the main text, on the PlenopticVideo dataset, our method effectively captures the motion of the coffee and flames, whereas the baseline method shows an over-reconstruction of the coffee and a failure to capture the flames. Consequently, our method shows a noticeable improvement in PSNR compared to the baseline.
However, when it comes to capturing fine texture details, we think this mainly depends on the baseline model’s design, such as the Gaussian properties, the optimization process, and the loss function. Since our method is a plug-and-play enhancement, it doesn’t modify these aspects of the baseline model. As a result, the depiction of fine texture details still relies heavily on the baseline's inherent capabilities, leading to a less noticeable improvement in the LPIPS metric.
Dear reviewer 3vN1,
We received diverse reviews for this submission. Please review the rebuttal and the comments from other reviewers and join the discussion as soon as you can. Thank you!
Your AC
This paper presents a two-stage pipeline for reconstructing 4D scenes with 3D Gaussians. Similar to previous works, the authors defined a set of canonical 3D gaussians and use MLPs as the deformable fields to deform them into new frames. The key difference is that the authors notice that the deformable field contains a lot of noise, so they propose to suppress the noise by using a two-stage deformation process. In the first stage, they use a temporal aggregation to aggreate temporal information. In the second stage, spatial aggregation is performed with nearest neighbor. The training loss is standard gaussian noise with TV loss. Results demonstrated that the proposed method outperforms previous methods.
优点
- The authors notice that the problem of current 4DGS is that the canonical space contain a lot of noise
- The authors propose a noise suppression strategy and temporal-spatial aggregation to remove the noise
- The authors perform extensive experiments and ablation studies to verify their idea
缺点
Personally I feel like the biggest issue for this paper is the presentation for the introduction and method are not that clear for me. Firstly, the authors claim that previous 4D Gaussian methods can have noise canonical space. I am not fully understand what does noise means here, there should be more explanation on this. Secondly, it seems that the noise supression strategy is just using temporal aggregation followed by spatial aggregation. But the authors claim that there are two components in Line 46: NSS and DTS, but seems that they represent the same thing. On Line 140, the authors claim that after the first deformation, the noise is attenuated. Why is that? With this being said, what is the difference of the last two rows of Table 4?
问题
What is the training time of this method compared with other 4D Gaussians? Would two stage training and aggregation delay training speed?
Considering the weaknesses I mentioned, I am happy to discuss with the authors and hope to see a better exposition of this paper, especially about the NSS and how that relates to DTS.
局限性
The authors have discussed their limitations.
Weaknesses:
We apologize for any confusion caused by our expression. Below, we will clarify your questions.
- "Noise": In the canonical + deformable design, we input the canonical Gaussian coordinates and time into the deformable network. The deformable network essentially performs basic operations (addition, subtraction, multiplication, division) on the coordinates and time . Since the point-to-point relationships within the canonical Gaussians are chaotic and erroneous, as shown in Figures 2 and 5 of the main text, it is predictable that feeding these erroneous coordinates into the deformable network will transfer this error into the deformation field, introducing inaccuracies in the final deformations .
- NSS and DTS: NSS is a strategy that uses two stages of deformation to reduce the impact of noise on the deformable network. During this process, we have two training phases. In the early training phase, we only supervise the first deformation. Once the Gaussian coordinates obtained from the first deformation are relatively accurate, we add the second deformation and shift the supervision to it, as detailed in Figure 1 of the attached PDF. This design does have limitations, which can be seen in Section 5.4 of the main text. DTS is a specific feature aggregation method we propose. Unlike 4DGaussian, D-3DGS lacks explicit spatiotemporal aggregation, so we designed DTS to aggregate spatiotemporal information. Considering that inaccurate coordinate relationships would be amplified through spatial aggregation (KNN), we only perform spatial aggregation in the second stage of NSS, which is DSAM. We name the spatial aggregation module "Denoised Spatial Aggregation Module" (DSAM) because the Gaussian coordinates input into DSAM are more accurate (denoised), as shown in the fourth column of Figure 8 in the main text. Therefore, we prefix the Spatial Aggregation Module with "Denoised." DSAM itself does not have denoising capabilities; it solely performs spatial feature aggregation.
- "The authors claim that after the first deformation, the noise is attenuated. Why is that?": We emphasize that there are two training phases in the process. In the first phase, we only train the first deformation operation (TAM). Once the Gaussian coordinates obtained from the first deformation are relatively accurate, we add the second deformation (DSAM) into the network for training. Because the Gaussian coordinates and their relationships are relatively accurate after the first phase of training, we claim that after the first deformation, the noise is attenuated.
- Difference Between the Last Two Rows of Table 4. We apologize for the oversight in not clearly explaining the settings for the penultimate row. The penultimate row indicates that we did not employ two training phases; instead, we trained TAM and DSAM simultaneously throughout the entire process. The main issue with this training approach is that the Gaussian coordinates deformed by TAM are never explicitly supervised, and thus, they still contain noise. Consequently, the Gaussian coordinates input into the DSAM module remain noisy. This is reflected in the quantitative results, where the performance in the penultimate row is actually lower compared to the third row of Table 4.
Questions:
We reported the time in the "Time" columns of Table 1 and Table 2 in the main text. Indeed, this two-stage training and aggregation approach does result in some delay in training speed. However, the difference is not substantial, and the method remains competitive. The specific reason is that we reduced the number of MLP layers in the deformable network.
Thanks the author for the response. Now I can understand this paper much better! From my understanding, the method contains two stages of training: the first stage is using only TAM, the second stage is using both TAM and DSAM. Both two stages form the NSS strategy, which mitigates the noise of 4DGS. I am still not fully understand why after the first deformation (TAM), the noise is attenuated. To my understanding, the difference from stage1 with original 4DGS is that TAM has temporal aggregation, is the noise atteunation coming from that?
First of all, thank you very much for your response, and we're glad to see that our previous answers helped you better understand our work. Regarding the questions you’ve raised, here are our answers:
First, we want to emphasize that the noise reduction pertains to the input of DSAM. Therefore, the meaning of line 140 in the main text is as follows: Due to the successful alteration of the input coordinate distribution during the first deformation stage, we can obtain more accurate Gaussian positions compared to the canonical Gaussian. As a result, the noise in the input to DSAM is attenuated. Next, we will explain the reasons for this phenomenon from two aspects: primary and secondary factors.
-
Two-Stage Training Method (NSS) - Primary Factor: For DSAM, if the input does not go through the first stage of deformation, it will consist of the canonical Gaussian; however, after the first stage of deformation, the input becomes the deformed Gaussian. Since the deformed Gaussian has more accurate positions and coordinate relationships, the noise in DSAM's input is effectively reduced.
-
Temporal Aggregation (TAM) - Secondary Factor: Compared to an ordinary deformation network, TAM performs better in the first stage of deformation due to its ability to aggregate temporal information, as demonstrated in the table below. In this table, we present experiments where the two-stage deformation was conducted without TAM, replacing it with an ordinary deformation network. From the table, it can be seen that using TAM can provide DSAM with more accurate input, leading to an improvement in the final rendering quality.
Setting PSNR SSIM LPIPS w TAM 32.02 0.944 0.043 w/o TAM 31.72 0.943 0.045 Therefore, for DSAM's input, if TAM is used in the first stage, the Gaussian's positions and coordinate relationships will be more accurate (compared to the coordinates of the Gaussian after deformation in ordinary network), which also gives TAM a certain ability to reduce noise.
In the revised version, we will provide further clarification on line 140 to help with understanding.
Thanks for the explanation. I think the explanation makes me much more clear why the proposed method can achieve noise reduction. I think the method is interesting and I am happy to raise my score.
Thank you for your recognition and adjustment of the score. We are glad that our rebuttal effectively addressed your concerns. We also appreciate your recognition of our methods. Thank you again for your constructive feedback and guidance throughout the review process.
Dear reviewer 1vtH,
We received diverse reviews for this submission. Please review the rebuttal and the comments from other reviewers and join the discussion as soon as you can. Thank you!
Your AC
This paper proposes DN-4DGS, a deformation-based Gaussian splatting technique for dynamic scene rendering. DN-4DGS utilizes a grid-based representation to model the deformation field. It features a two-stage deformation inference pipeline that aggregates temporal and spatial information to reduce noise in the deformation field. These two stages are trained sequentially. Experimental results show that DN-4DGS outperforms other methods in rendering quality.
优点
The Temporal Aggregation Module and Denoised Spatial Aggregation Module leverage temporal and spatial features to regularize the deformation field. This idea is both sound and novel. Based on the reconstructed canonical Gaussians demonstrated in the paper, I find that the reconstructed deformation field may be closer to the true deformation. The authors provide clear intuitions about these designs.
缺点
There are some inaccurate description in the paper. In line 158, has the time parameter t. Does this mean that the position of the Gaussian is deformed to the time t? In formulas (3) and (5), the function depends only on the current coordinates x, y, z and time t. However, it actually also involves temporal features and neighboring points in space. These parameters are not described in these two formulas.
The authors claims that the reason for performing temporal optimization first is to reduce the impact of noise on the KNN in the second stage. However, in the ablation study, there is a lack of experiments that only use DSAM without TAM, where TAM could be replaced with an ordinary deformation network. Therefore, it is difficult to determine the correctness of this claim.
The authors did not provide video results.
问题
Compared to the 3DGStream method, what are the advantages and disadvantages of DN-4DGS? Adding a discussion of the latest methods can highlight the unique features of this method.
局限性
A potential limitation could be the inability to handle scenarios with significant movement.
-
Thank you for pointing that out. It was indeed a typo, and we will correct "" to "".
-
Thank you for your reminder. We will correct the above formulas and add these two parts.
Formula 3: , where represents the set of 's neighbors.
Formula 5: , where represents the set of neighbors of ().
-
Thank you for your suggestions. We conducted experiments where we used only DSAM without TAM, replacing TAM with an ordinary deformation network. The results can be seen in the table below. In the camera-ready version, we will incorporate this experiment into the original table to make it complete.
Setting PSNR SSIM LPIPS w TAM 32.02 0.944 0.043 w/o TAM 31.72 0.943 0.045 -
Video results: According to NeurIPS 2024 policy, we can only provide the video link to the Area Chair (AC). Please contact the AC for the video link.
-
Similarities: Our method and 3DGStream both perform deformations on 3D Gaussians (3DGs) where absolute positions and relative positional relationships are more accurately maintained. For each timestep , 3DGStream uses the 3DGs from the previous timestep as initialization. In contrast, our method uses canonical Gaussians (which are time-independent) as the initialization. To achieve accurate relative positional relationships and minimize noise interference from the canonical Gaussians, we employ a two-stage deformation strategy. The first stage obtains accurate 3DGs, and the second stage further deforms these 3DGs to achieve preciser rendering results.
Advantages:
- Flexibility: Unlike 3DGStream, our method does not require the results from the previous timestep. We can render at any arbitrary time without needing the previous timestep's 3DGs. On the other hand, 3DGStream relies on the 3DGs from the previous timestep for rendering the next.
- Robustness: 3DGStream heavily depends on the reconstruction quality at timestep 0. If the initial reconstruction is poor, subsequent reconstructions will be negatively affected, and these errors can accumulate over time. Our method, however, starts with noisy canonical Gaussians and improves the reconstruction quality through a two-stage deformation process, resulting in progressively better reconstructions.
Disadvantages:
- Training Efficiency: 3DGStream employs an online reconstruction approach, leading to shorter training time and faster rendering speed. In contrast, our method involves offline training, which results in relatively longer training time.
-
Thank you for your reminder. We will consider the application of our method in scenarios with significant movement.
Thank you to the reviewers for their valuable reviews, which will help us further improve the quality of our work. We have addressed each reviewer's questions individually within their respective sections, and we look forward to further discussions during the discussion phase. Below and in the attached PDF, we present more comprehensive understanding and visualizations of our method.
This submission received all positive reviews after a strong rebuttal. The reviewers feel the proposed method is sound and novel, and the ablation study is mostly solid. The AC recommends accepting the paper, but asks the authors to carefully address the concerns raised by the reviewers in the camera-ready version, including the clarity of the exposition, references suggested by the reviewers, and additional ablations.