6.5

/10

Poster4 位审稿人

最低6最高7标准差0.5

4.5

置信度

正确性3.0

贡献度3.3

表达3.3

NeurIPS 2024

MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting

Ruijie Zhu,Yanzhe Liang,Hanzhi Chang,Jiacheng Deng,Jiahao Lu,Wenfei Yang,Tianzhu Zhang,Yongdong Zhang

OpenReview PDF

提交: 2024-05-09更新: 2024-11-06

摘要

关键词

3D GaussianNovel View SynthesisOptical flowDynamic scene

评审与讨论

审稿意见

评分: 7置信度: 52024-06-29

The proposed method extends the idea of GaussianFlow to explicitly decouple camera motion and object motion from optical flow on input monocular video. Specifically, the iterative camera pose refinement further boosts the rendering quality and performance on various datasets. Extensive qualitative and quantitative results are shown in the paper to valid the proposed contributions.

优点

The paper extends the idea of GaussianFlow to explicitly decouple the camera motion and object motion from optical flow, which is reasonable and the formulation of the proposed method also makes sense.
Explicitly modeling camera motions in an iterative way during training is useful for monocular NVS because the rendering error partially brought by incorrect camera pose estimation could be further decreased. Though the camera pose refinement is not originally introduced by the paper, the decoupling scheme is highligh.
Ablation study (both the main paper and appendix) is detailed and inspiring, validating the main contributions claimed in the paper.
Both qualitative and quantitative results look good.

缺点

No prominent weakness is witnessed.

问题

Since the paper showcased the failure case in Fig.13, I would like to know whether the results could be improved if the camera refinement step was removed (leaving that optical flow equals to motion flow) considering camera poses in DyNeRF dataset are fixed? (Original GaussianFlow paper minimized the difference between optical flow and Gaussian flow, where the proposed method minimized the difference between Gaussian flow and motion flow, which is optical flow - camera flow given camera flow in DyNeRF is 0).

局限性

Limitations have already been included and acknowledged in the paper. The proposed method seems to be less effective on cases with less camera motions.

作者回复

2024-08-06

Section 4 Response to Reviewer 2xXk

We thank the reviewer for the constructive assessment of our work. In the subsequent sections, we respond to each concern in detail. Please feel free to use the discussion period if you have any additional questions.

4.1 Questions

4.1.1 Failure case in DyNeRF dataset

Thank you to the reviewer for the constructive feedback. As discussed in our response to reviewer 4hTV, the failure on the DyNeRF dataset primarily stems from incorrect computation of the camera flow. More specifically, this issue arises from inaccurate depth estimation. As illustrated in Figure D of the PDF attachment, our method renders incorrect depth, leading to noticeable artifacts. Theoretically, removing the camera refinement step and directly deriving the Gaussian flow from the optical flow can avoid introducing erroneous depth. Unfortunately, after conducting experiments as suggested by the reviewer, the rendered scenes still exhibit certain artifacts, which typically appear only in the test views and not in the training views. We speculate that this issue arises because of the sparse viewpoint setup in the DyNeRF dataset, which is composed of videos captured by several fixed-position cameras rather than a single monocular video. These sparse viewpoints may lead to suboptimal Gaussian initialization and cause the model to overfit to the training views during subsequent optimization.

Nevertheless, we sincerely thank the reviewer for this insightful ideas, which inspired us to think more deeply about the possible limitations of our method. For scenes with stationary cameras, using reliable depth priors to provide regularized constraints may be a potential solution to alleviate this ill-posed problem, such as the recent work MoDGS[10]. In future work, we aim to combine sparse-view 3DGS methods to enhance the quality of dynamic scene reconstruction under sparse viewpoint conditions.

评论- Good job

2024-08-09

I am satisfied with the response and I will keep my positive score.

评论- Thanks to reviewer 2xXk

2024-08-09

Dear Reviewer 2xXk,

We sincerely appreciate your recognition of our work and the valuable feedback you provided. Your detailed comments and constructive suggestions have not only helped us improve the manuscript but also offered significant guidance for our future research.

Thank you once again for your time and effort in reviewing our work.

Best regards,

Authors of Submission 3272

审稿意见

评分: 6置信度: 42024-07-08

This paper proposes using off-the-shelf 2D optical flow to supervise the deformation field for 3D Gaussian Splatting (3DGS) in dynamic scenes. The optical flow is decomposed into camera flow and motion flow. The 3DGS flow is projected into 2D to match the estimated flow. Camera pose and Gaussian parameters are optimized iteratively. Experiments demonstrate the effectiveness of the proposed method.

优点

1.The formulation and utilization of optical flow are reasonable, and the experiments demonstrate its effectiveness.

缺点

1.The idea of using optical flow to guide dynamic 3DGS modeling is not particularly novel; it seems more like an engineering effort. 2.With the additional 2D optical flow supervision, the improvement shown in the experiments is subtle.

问题

1.Without explicitly modeling reflections, how does it perform well on the NeRF-DS dataset, even outperforming NeRF-DS with additional modeling? 2.With additional optimizations, how much longer would the training time be compared to the baseline and other methods?

局限性

Yes, other limitations are listed in the weakness.

作者回复

2024-08-06

Section 3 Response to Reviewer pL6e

3.1 Weaknesses

3.1.1 Novelty

We appreciate your detailed review of our work. We speculate that there might be some misunderstandings about the novelty and contributions of our work, and we'd like to make further clarification.

Our primary contribution is the separation of camera motion and object motion in optical flow, which provides clear motion guidance for dynamic Gaussians. Specifically, our method begins by acquiring an optical flow prior through an off-the-shelf optical flow estimation network. It then calculates the optical flow caused solely by camera motion by integrating depth and camera pose. Finally, it isolates the optical flow attributed to object motion to constrain the deformation of 3D Gaussians. This approach effectively creates a precise correspondence between the 2D motion prior and the 3D Gaussian motion.

Therefore, it is not entirely accurate to characterize our work as merely applying optical flow supervision to Gaussian motion. As shown in Table 3 of our main paper, directly constraining the Gaussians and deformation fields with optical flow even results in performance degradation. We believe this is because the camera motion incorrectly influence the deformation of dynamic Gaussians, while our method effectively addresses this issue.

Previous research on motion in NeRF and 3DGS fields, such as [7] and [8], have not addressed the interplay between camera and object motion from the perspective of optical flow formation. In contrast, our method first proposes to decouple optical flow for motion guidance, presenting a novel solution for dynamic scene rendering.

3.1.2 Performance gains

We appreciate the reviewer's examination of our work. We would like to provide a detailed clarification and further explanation regarding our performance contributions. Previous dynamic Gaussian methods often struggle to achieve accurate reconstruction in dynamic scenes with complex motion and imprecise poses, sometimes even failing to recover the structure of dynamic objects, as illustrated in Figure 6 of our main paper. By incorporating both motion guidance and camera pose optimization, our approach achieves more accurate reconstruction in complex dynamic scenes, as demonstrated in Table 1 and 2 of our main paper (with a 0.93dB mean PSNR increase in NeRF-DS and 2.3dB mean PSNR in HyperNeRF). The ablation study also indicates that our method significantly improves the rendering quality in scenarios with complex motion and inaccurate poses.

3.2 Questions

3.2.1 Modeling reflections

We appreciate the reviewer's questions. Indeed, we do not explicitly model reflections like NeRF-DS does. However, our approach has two advantages over NeRF-DS:

Effectiveness of the deformable 3DGS Framework:

The 3DGS technique represents the scene as a set of anisotropic 3D Gaussians and employs an efficient differentiable rasterizer for rendering, achieving high-quality and real-time results [9]. On multiple datasets, 3DGS has matched or surpassed the performance of previous state-of-the-art NeRF methods. Our baseline method extends the rendering quality advantages of 3DGS to dynamic scenes by incorporating deformation fields into the 3DGS framework. As shown in Table 1, even our baseline method achieves rendering performance comparable to NeRF-DS methods without explicit modeling of reflections. The anisotropic 3D Gaussian spheres and the deformation field can model complex geometric details over time, enabling the baseline method to deliver high-quality rendering in dynamic scenes.

Effectiveness of Motion Guidance:

While the baseline method delivers satisfactory rendering in most cases, there are still limitations in some challenging dynamic regions (e.g., the plate in Figure 5 of our main paper). Our method enhances performance in dynamic scenes by providing reliable motion constraints to the Gaussian deformation. By accurately separating and constraining different motion components in dynamic scenes, our method performs exceptionally well even in foreground areas with complex textures and reflections.

3.2.2 Training time

We appreciate the reviewer's question. We recognize that training time is a crucial factor in assessing the practicality of a method. Below, we provide detailed data of training times for our method on the NeRF-DS dataset:

Training Time on the NeRF-DS Dataset

Training Time	As	Basin	Bell	Cup	Plate	Press	Sieve
Baseline	1h 1m	1h 11m	1h 42m	1h 3m	1h 0m	0h 51m	0h 57m
Ours-w/o pose refinement	1h 8m	1h 15m	1h 53m	1h 13m	1h 6m	0h 58m	1h 3m
Ours	1h 33m	1h 46m	2h 34m	1h 34m	1h 30m	1h 17m	1h 25m
NeRF-DS	6h 43m	6h 48m	6h 49m	6h 50m	6h 53m	6h 48m	6h 47m

Our method exhibits a slight increase in training time compared to the baseline method. This is primarily due to the inclusion of a differentiable Gaussian flow rasterizer[8] and the process of camera pose refinement. Notably, since the proposed improvements only influence the training process, our method maintains same real-time rendering speeds with our baseline during inference.

评论- Replying to Rebuttal by Authors

2024-08-13

Thanks for clarification about the contribution: camera and object motion sepation from the estimated optical flow, especially for the cases with complex motion and inaccurate poses. Given the additional analysis and training time comparison, I would like to raise the score.

评论- Thanks to Reviewer pL6e

2024-08-13

Dear Reviewer pL6e,

Thank you for improving the score of our paper. We are glad that our clarification of the contribution of camera and object motion separation was helpful and appreciate your recognition of our additional analysis and training time comparison. We will make sure to present these aspects clearly in the revised manuscript to highlight the effectiveness and efficiency of our approach.

Thank you again for your feedback and for recognizing the contribution of our work.

Best regards,

Authors of Submission 3272

审稿意见

评分: 6置信度: 52024-07-08

The paper proposed MotionGS a novel deformable 3D Gaussian splatting approach. The approach initializes camera poses and 3D Gaussians based on an analytic structure-from-motion method as 3DGS. In addition an optical flow network is used to compute optical flow between neighboring frames. Given the initial depth of the 3D Gaussians, the paper proposes to use an optical flow decoupling module separating the camera and object motion. This process is optimized to obtain better camera poses and better views through 3D Gaussian splatting and then iterated. The paper uses Gaussian flow which has been very recently proposed by Gao et al., 2024, however only matching the flow due to object motion. The iterative improvement of the camera poses is based on the estimation of small residuals between consecutive views. The paper reports experiments on the NeRF-DS and the HyperNeRF dataset and achieving high construction quality and improving of Deformable-3DGS. An ablation study is presented comparing object motion flow with optical flow without separation and the effects of camera pose refinement. The supplemental material contains more details on Gaussian flow, more implementation details, visualizations and discussions. The appendix also contains a further ablation study quantifying the influence of the depth, optical flow etc.

优点

Explicit motion guidance for 3D Gaussian reconstruction is a logical next step for deformable 3D Gaussians.

The paper combines and extends some very recent parallel work in optimization of camera pose estimation by Fu et al., 2024 and optical flow based Gaussian flow by Gao et al., 2024.

The paper demonstrates the successful application of the proposed MotionGS framework to non-rigid scenes recorded by video streams.

The reported reconstruction quality of MotionGS is competitive and often state-of-the-art for videos in the chosen datasets.

缺点

The paper has a large range of limiting assumptions which are not always made clear. COLMAP must find an initial solution for camera pose estimation which requires enough static features in the scene. A video input stream is needed to ensure that neighboring viewpoints are very close for optical flow estimation to succeed. The 3DGS canonical representation must be found in order for depth estimation to work.

The datasets used in evaluating the proposed method can be processed with deformable 3DGS and hence the proposed MotionGS method is limited to improving reconstruction quality. No reconstruction beyond the limitation of prior work is demonstrated and the above limiting assumptions makes this likely challenging.

The quality of camera pose estimation is not well evaluated. There is only a visualization in the appendix which is very hard to interpret. (See Fu et al. 2024 for a more meaningful evaluation).

The proposed method appears to be quite brittle as can see from the ablation study in Table 5 in appendix A.3. The method falls below the baseline by leaving out the motion mask, switching to single image depth estimation and even by switching to another SOTA flow estimation network.

问题

The key for MotionGS appears to be the separation of optical flow into object flow and camera flow. The paper does not really review work in this area, would there other ways to attribute the flow to object and camera motion?

If Deformable 3DGS fails to find a reasonable canonical representation at initialization, is there a fallback to start the iterative optimization?

On which data was the optical flow network trained on, and has the optical flow network been finetuned on the data?

Would self-supervised optical flow methods which are known to generalize better, improve the flow estimation?

局限性

The authors provide a limitation statement in Appendix A.5 but focused only on optical flow failures. The claim of optical flow failure for static cameras is surprising, maybe a shortcoming of the particular flow estimation. It would be more interesting to investigate the robustness to large motions which likely needs to severe flow failures. In addition the limitation of structure-from-motion and depth estimation are not discussed.

作者回复

2024-08-06

Section 2 Response to Reviewer 4hTv

2.1 Weaknesses

2.1.1 Limiting assumptions

Please see Section 0.1 for details.

2.1.2 Beyond the limitation of prior work

We appreciate the reviewer's careful examination. Indeed, in some challenging scenarios, our baseline method is confined to reconstructing static backgrounds and exhibits limited performance on dynamic objects. As shown in Figure 6, the baseline method fails to capture even the basic structure of the broom. The baseline method also acknowledges this limitation and attributes it to inaccuracies in camera poses. Therefore, we propose camera pose refinement module to iteratively optimize the 3D Gaussians and camera poses. As a result, MotionGS is more capable of handling scenes with complex motions and inaccurate camera poses than our baseline method.

2.1.3 The quality of camera pose estimation

Please see Section 0.3 for details.

2.1.4 Ablation study in Table 5

We appreciate the reviewer's detailed analysis of our work. Here, we provide a detailed explanation and clarification regarding the ablation study presented in Table 5:

The Role of the Motion Mask:
- The motion mask plays a critical role in our pipeline. As discussed in Section 4.2 of the main paper, our method benefits more from the motion mask than previous approaches[7]. Specifically, our method effectively utilizes the motion mask to filter out unreasonable motion flow in static areas. In contrast, when only optical flow supervision is used, the motion mask cannot be directly applied.
Limitations of Monocular Depth Estimator:
- Our ablation experiments show that using an off-the-shelf monocular depth estimator introduces a performance drop. We argue that this is due to the scale ambiguity of monocular depth estimators, which results in inaccurate camera flow and motion guidance. In contrast, depth maps rendered by our method provide accurate scales and better details, as shown in the Figure B of the PDF attachment.
Impact of Optical Flow Networks:
- The performance drop observed when switching optical flow networks is indeed unexpected. With further analysis, we find that FlowFormer performed inadequately in some challenging scenes, especially for the "plate" scene, resulting in an overall decrease in performance. Since GMFlow is trained on more datasets, it may have stronger zero-shot generalization capability than Flowformer.

2.2 Questions

2.2.1 Related work on optical flow decoupling

Please see Section 0.2 for details.

2.2.2 Fallback to start the iterative optimization

Thank you for this important question. Like our baseline method, we freeze the deformation field during the Gaussian initialization to obtain a reliable canonical representation. In most tested datasets, this canonical representation can be initialized well. Even with inaccurate camera poses, our method can render satisfactory results through iterative optimization. Currently, we have not yet explored extreme cases (e.g. COLMAP failure) or developed alternative initialization strategies to replace COLMAP.

2.2.3 Optical flow network

Thanks for your question. The GMFlow model[4], is trained on a range of datasets (KITTI, HD1K, FlyingThings3D, and Sintel). We use the pre-trained model from the original paper without further fine-tuning on our datasets.

2.2.4 Self-supervised optical flow methods

Thank you for the valuable suggestion. We used the self-supervised optical flow estimation algorithm MDFlow[5] in our ablation experiments. The results show that the motion constraints provided by the self-supervised optical flow estimation network cannot bring effective improvement to the network.

Novel View Synthesis Results of NeRF-DS Dataset

Method	SSIM↑	PSNR↑	LPIPS↓
baseline	0.8394	23.61	0.1970
ours(w/o pose refinement)	0.8609	24.12	0.1763
Self-supervised flow estimator	0.8308	23.25	0.2137

We speculate that this is because self-supervised methods cannot provide sufficiently accurate optical flow priors. With the increasing amount of annotated data and the development of foundational models in recent years, the generalization ability of fully-supervised optical flow estimation methods has significantly improved, surpassing traditional self-supervised methods. Similarly, works like MonoNeRF[6], Dynpoint[3], and DynIBaR[7] also use fully-supervised optical flow networks for optical flow priors.This phenomenon may also illustrates the importance of accurate motion priors, while erroneous or noisy motion constraints may even have a negative effect on the optimization.

2.3 Limitations

We appreciate the reviewer's insightful feedback. We recognize that the term "instability in optical flow computation" mentioned in our paper might have introduced some confusion. Upon further analysis, we find that the fixed and sparse camera viewpoints in the DyNeRF dataset hinder accurate depth rendering, affecting subsequent camera flow calculations and leading to artifacts as shown in Figure B of the PDF attachment. The inaccuracies in motion flow primarily comes from the inaccuracy of the camera flow, rather than a failure of the optical flow estimation itself. It is also important to clarify that the DyNeRF dataset is not continuous monocular video but rather dynamic scenes with sparse viewpoints, which posed challenges to the canonical 3D Gaussian initialization. This poor initialization prevents our method from rendering accurate depth and performing subsequent optimizations, and in future work we consider combining sparse-view 3DGS methods to improve the robustness of our MotionGS.

评论- Detailed Rebuttal

2024-08-10

I thank the authors for their detailed rebuttal with additional experiments, figures and tables. The performance drop with self-supervised MDFlow is very interesting. In general, I find the answers informative and helpful. My additional comments are as follows:

I think, it would be helpful to readers (especially those less familiar with 3DGS) to include the discussion in Section 0.1 into the paper.

I find the explanation of the reduced performance with monocular depth estimation both interesting and very reasonable.

Ideally, one could evaluate improvement of camera poses with synthetic data where ground truth is known but I acknowledge the difficulty as real-world challenges of using COLMAP must be included at the same time. Maybe, this can be left to future work.

In my experience, it is not uncommon for SfM to completely fail in very dynamic real-world scenes. I appreciate that imprecise pose estimates can be improved, however, depending on the imagery complete failure is a challenge. Again maybe something for future work.

评论- Response to the additional comments by Reviewer 4hTv

2024-08-10

Dear Reviewer 4hTv,

Thank you for your positive feedback on our rebuttal and your insightful comments. Here is our response to your additional suggestions:

Including the Discussion in Section 0.1:

We appreciate your suggestion to include the discussion from Section 0.1 in the paper. We have now integrated this discussion into L211-L219 (Section 4.2) in our main paper, to provide a clearer context and aid in the understanding of our contributions.

Explanation of Reduced Performance with Monocular Depth Estimation:

Thank you for your approval of our explanation. We have ensured that this explanation is well-integrated into the revised paper.

Evaluating Camera Pose Improvement with Synthetic Data:

We agree that evaluating the improvement of camera poses using synthetic data, where ground truth is available, would be ideal. We appreciate your suggestion and have noted it as our future work. We have also added a brief discussion in the limitation to reflect this point.

Challenges with SfM in Dynamic Real-World Scenes:

We concur with your observation regarding the challenges of structure-from-motion (SfM) in highly dynamic real-world scenes. Indeed, complete failures of SfM are a known issue in such scenarios. While our method can improve imprecise pose estimates, it cannot fully mitigate the risk of complete failure, especially under extreme conditions. We have highlighted this limitation in the revised paper.

Thank you once again for your professional and constructive comments and recognition of our work, which is crucial to further improving the quality of our work.

Best regards,

Authors of Submission 3272

审稿意见

评分: 7置信度: 42024-07-16

This paper presents a novel approach to dynamic scene reconstruction by incorporating explicit motion priors into 3D Gaussian Splatting (3DGS). The proposed framework, MotionGS, introduces an optical flow decoupling module that separates camera flow and motion flow, which respectively correspond to camera movement and object motion. This separation allows for more precise motion guidance during the deformation of 3D Gaussians. Additionally, a camera pose refinement module is implemented to alternately optimize 3D Gaussians and camera poses, addressing inaccuracies in camera pose estimation. Extensive experiments demonstrate that MotionGS outperforms state-of-the-art methods on datasets such as NeRF-DS and HyperNeRF, achieving significant improvements in both qualitative and quantitative results for dynamic scene reconstruction.

优点

The paper is written clearly: The authors present a well-structured flow from the problem definition, through the intuition behind the approach, to the implementation and analysis. Each section is closely connected to the main point, making the paper easy to follow.
The proposed method is novel, intuitive, and simple to implement: MotionGS introduces a unique approach by decoupling optical flow into camera flow and motion flow, providing explicit motion guidance for 3D Gaussian Splatting. This method is innovative and straightforward, making it easy to adopt and integrate into existing systems.
Experiments validate the results well, with extensive visualizations and ablation studies: The paper robustly validates the proposed method with extensive experiments on datasets like NeRF-DS and HyperNeRF. Numerous visualizations and detailed ablation studies clearly demonstrate the effectiveness of each component of MotionGS.

缺点

The time, memory, and storage costs are not revealed: While the proposed method is illustrated as a simple solution, it is crucial to analyze the additional computational burden it imposes. The paper does not provide an analysis of the time, memory, and storage requirements, which are important factors in evaluating the practicality of the method.
The data sampling mechanism is missing: The paper lacks details on the data sampling mechanism used for training. It is unclear how image pairs are sampled—whether t and t+1 can be any two frames in the video or they must be adjacent to each other. If the former is true, it would imply that N(N−1) optical flow maps need to be calculated, which could significantly impact storage and time requirements. Providing these details is essential to understand the overall efficiency and feasibility of the approach.

问题

If no flow net is adopted, is it possible to calculate the loss as follows?

Estimate optical flow as the sum of camera flow and motion flow
Using optical flow to warp I_t+1, and calculated the loss with I_t.

局限性

No negative societal impact

作者回复

2024-08-06

Section 1 Response to Reviewer rwSL

1.1 Weaknesses

1.1.1 Time, memory, and storage costs

Thank you for your question. In Table 4 of Appendix A.2, we have provided a detailed breakdown of the rendering speed and storage requirements of our method on the NeRF-DS dataset. Our approach aims to introduce explicit motion constraints and pose refinement during training without bringing additional burden to the original Deformable-3DGS during inference. Thus, the rendering speed and inference time of our method are consistent with our baseline.

For model training, we list the training time per scene and peak memory usage on the NeRF-DS dataset as below, providing a comprehensive assessment of resource usage during training. Compared to our baseline, our approach incurs increased training time and peak memory usage. This is primarily due to the additional rendering of Gaussian flow and the refinement of camera poses, which are necessary for our method.

Training Time on the NeRF-DS Dataset

Training Time	As	Basin	Bell	Cup	Plate	Press	Sieve
Baseline	1h 1m	1h 11m	1h 42m	1h 3m	1h 0m	0h 51m	0h 57m
Ours(w/o pose refinement)	1h 8m	1h 15m	1h 53m	1h 13m	1h 6m	0h 58m	1h 3m
Ours	1h 33m	1h 46m	2h 34m	1h 34m	1h 30m	1h 17m	1h 25m
NeRF-DS	6h 43m	6h 48m	6h 49m	6h 50m	6h 53m	6h 48m	6h 47m

Max GPU Memory Usage on the NeRF-DS Dataset

Max GPU Memory (GB)	As	Basin	Bell	Cup	Plate	Press	Sieve
Baseline	15.67	13.61	15.97	15.29	9.66	10.65	12.17
Ours	16.61	14.52	17.73	15.7	10.62	11.62	12.97

1.1.2 Data sampling mechanism

Thanks for your question. Here is a more detailed explanation of our data sampling mechanism:

Data Sampling Strategy: We adopt the same data sampling strategy as the baseline method, i.e., reading image sequences in a randomly shuffled order. For an N-frame video, the frames are shuffled and then read sequentially. In each iteration, we read two frames and caluate the optical flow between them. To enhance efficiency, the second image from the last iteration is used as the first image in the current iteration. Thus, except for the first iteration, only one new image is read in each subsequent iteration. Consequently, there are (N-1) iterations per epoch, with optical flow computed once in each iteration. This strategy balances the introduction of accurate motion priors with maintaining training efficiency.
Optical Flow Calculation and Storage Strategy: During the first epoch of training, we calculate the optical flow for all adjacent frame pairs, resulting in a total of (N-1) optical flow maps. In subsequent epochs, we do not reshuffle the image sequence, allowing us to reuse the optical flow maps calculated in the first epoch. This effectively eliminates the need to recompute optical flow maps in each epoch, significantly reducing computational overhead. Taking the "as" scene (846 frames) in the NeRF-DS dataset as an example, the optical flow calculation for each pair takes approximately 22 ms, resulting in a total computation time of 18.7 seconds. Each optical flow map requires around 0.99 MB of storage, resulting in a total storage requirement of 835.13 MB.

1.2 Questions

1.2.1 Self-supervised flow loss

We appreciate the insightful suggestion provided by the reviewer. Following the reviewer's suggestion, we conduct experiments to evaluate this approach and have confirmed its effectiveness. Specifically, we estimate the Gaussian flow corresponding to the optical flow and use it to warp the $I_t$ frame. We then compute the photometric loss with the $I_{t+1}$ frame. In experiments on the NeRF-DS dataset shown as below, this method outperforms our baseline but is less effective compared to our proposed method.

Novel View Synthesis Results of NeRF-DS Dataset

Method	SSIM↑	PSNR↑	LPIPS↓
baseline	0.8394	23.61	0.1970
ours(w/o pose refinement)	0.8609	24.12	0.1763
Self-supervised flow loss	0.8474	23.76	0.1807

We hypothesize that the discrepancy arises because the self-supervised loss may not provide accurate supervision in areas with similar colors. Nevertheless, it is evident that employing self-supervised optical flow loss can reduce dependence on off-the-shelf optical flow estimation. We provide qualitative experimental results in Figure A of the PDF attachment, which show that this idea can effectively provide motion constraints. When an optical flow estimation network is either unavailable or inaccurate, this approach can serve as a valuable alternative to improve rendering quality.

We sincerely appreciate your valuable idea, which has greatly enhanced our understanding to the problem. We are excited to incorporate this perspective into our research.

评论- Issues addressed well

2024-08-13

All my concerns have been addressed well. Besides, I appreciate that the authors have evaluated the self-supervised setting. I adjust my score to 7 and suggest putting this extra try on the supplementary material.

评论- Thanks to Reviewer rwSL

2024-08-13

Dear Reviewer rwSL,

Thank you for your recognition and adjustment of the score. We are glad that our rebuttal effectively addressed your concerns. We also appreciate your recognition of our efforts in evaluating the self-supervised setting. We have incorporated your suggestions, and the supplementary material now contains detailed descriptions, figures, and tables related to the self-supervised experiments.

Thank you again for your constructive feedback and guidance throughout the review process.

Best regards,

Authors of Submission 3272

作者回复

2024-08-07

Section 0: Response to all reviewers

We would like to extend our heartfelt gratitude to all the reviewers for their thorough evaluation and constructive feedback on our work. Their insights have been invaluable in refining and enhancing the quality of our research. Below, we provide additional explanations and clarifications, along with a list of references cited during the rebuttal period and an attached PDF containing supplementary figures.

Section 0.1: Clarifications on Assumptions and Preconditions

To provide a clearer understanding of our approach, we would like to further explain the assumptions and preconditions underlying our method:

Use of COLMAP:
- Like many 3DGS-based methods, our approach relies on COLMAP for initial camera pose estimation, which requires the presence of sufficient static features in the scene. Typically, such static features are widely present in most real-world scenarios, especially in background areas. Moreover, even if the initial poses provided by COLMAP are not perfectly accurate (e.g., HyperNeRF dataset), our camera pose refinement module can adaptively adjust these poses, ensuring high-quality reconstruction of dynamic scenes.
Use of Optical Flow Estimation Network:
- We recognize that computing optical flow requires sufficient overlap between images. In all the tested scenes, this overlap condition is consistently met between any two frames (see data sampling mechanism in Section 1.1.2).
- For cases where there might be significant viewpoint differences in long videos, we recommend segmenting the video and shuffling the frames within each segment to ensure sufficient overlap between adjacent frames in the shuffled sequence.
Canonical Representation of 3DGS and Depth Estimation:
- Similar to our baseline, we first initialize a canonical 3DGS without deformation fields. This is crucial for obtaining scale-consistent depth necessary for subsequent camera flow computation. Poor Gaussian initialization might be a challenge of our method, as seen in the failure case in DyNeRF scenes.

Section 0.2: Related Work on Optical Flow Decoupling

To offer a more comprehensive understanding of our approach, we further discuss related work on optical flow decoupling. Utilizing optical flow to provide motion priors for dynamic scenes has been explored in previous works. For instance, Dynamo-Depth[1] synthesizes optical flow from both camera motion and object motion for self-supervised depth estimation. However, methods for dynamic scene reconstruction typically do not explicitly decompose optical flow into camera and object motions. Instead, they focus on identifying pixel correspondences in 3D space, known as scene flow. For example, NSFF[2] combines the predicted scene flow with camera poses to project points onto adjacent frames to compute optical flow, and then uses optical flow priors to supervise the scene flow. Similarly, Dynpoint[3] calculates scene flow priors by estimating depth and optical flow, which are then used to constrain correspondences between adjacent frames in dynamic scenes. Unlike these NeRF-based methods, the correspondence between Gaussian and pixels is a complex many-to-many mapping, making it unable to directly use scene flow for constraints. To address this issue, we propose decoupling 2D motion flow from optical flow to constrain the deformation field.

Section 0.3: Evaluation of Camera Pose Quality

We'd like to further discuss and elaborate on the challenge of evaluating camera pose quality in dynamic scenes. Unlike static scene datasets (e.g., Tanks & Temples) that use COLMAP to obtain the ground truth of camera poses, we assume that COLMAP may not provide accurate poses for dynamic scene datasets. Therefore, we refine the camera poses initialized by COLMAP. In this setting, we lack ground truth for a direct quantitative comparison. Despite this, the Table 5 of our main paper shows that the camera poses refined by our method further boost the rendering quality, demonstrates the accuracy of our refined poses. For further clarity, additional visualizations of pose trajectories are provided in Figure C and its video version are included in our comments to the AC.

Section 0.4: References

Here are references cited during the rebuttal:

[1] Dynamo-depth: fixing unsupervised depth estimation for dynamical scenes, NeurIPS 2023

[2] Neural scene flow fields for space-time view synthesis of dynamic scenes, CVPR 2021

[3] Dynpoint: Dynamic neural point for view synthesis, NeurIPS 2023

[4] Gmflow: Learning optical flow via global matching, CVPR 2022

[5] MDFlow: Unsupervised optical flow learning by reliable mutual knowledge distillation, TCSVT 2022

[6] MonoNeRF: learning generalizable NeRFs from monocular videos without camera poses, ICCV 2023

[7] Dynibar: Neural dynamic image-based rendering, CVPR 2023

[8] Gaussianflow: Splatting gaussian dynamics for 4d content creation, arXiv 2024

[9] 3D Gaussian Splatting for Real-Time Radiance Field Rendering, SIGGRAPH 2023

[10] MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos, arXiv 2024

We appreciate the reviewers’ suggestions, which have guided us in further improving the robustness and clarity of our work. We look forward to any additional feedback and are committed to advancing this research in light of the insightful comments provided.

Thank you once again for your time and effort in reviewing our submission. We have also included an attached PDF document with additional details and supporting information for your reference.

最终决定Accept (poster)

2024-09-25

After the initial stage, reviewers raised some concerns regarding the novelty, robustness, and the lack of critical performance analysis (speed, memory).

All of the aforementioned issues were successfully cleared by the authors which lead reviewers to rate the paper unanimously above the acceptance bar.

The AC strongly incites the authors to include the additional experiments conducted for the rebuttal in the final version of the paper. Most importantly, please include the additional memory & speed evaluation. Furthermore, please make sure to include results of the discussion with reviewer 4hTv whose contribution was very fruitful.