PaperHub
5.3
/10
Poster4 位审稿人
最低5最高6标准差0.4
5
6
5
5
4.0
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

Fully Explicit Dynamic Gaussian Splatting

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

4D Gaussian Splatting with static & dynamic separation using an incrementally extensible, keyframe-based model

摘要

关键词
Gaussian SplattingNeural RenderingComputer Vision

评审与讨论

审稿意见
5

The author proposed a Fully Explicit Dynamic Gaussian Splatting method to decrease the training time and memory cost w/o losing the high fidelity. The core of the proposed approach based on the decomposition of the static and dynamic Gaussians during training is to sample dynamic Gaussians at sparse timestamps and then do interpolation to estimate the dynamic Gaussians within each time interval. Also, the proposed backtracking technique is used to prune the useless dynamic Gaussian points. However, there are still some unclear questions. Will consider increasing the rating if the authors can provide reasonable response.

优点

  1. unlike the existing works, the authors attempt to select keyframes and utilize different interpolation techniques on keyframes to model the dynamic Gaussians.
  2. The results on the N3V dataset show that the proposed method achieves comparable or higher performance than the existing methods, while largely reducing the memory and training time cost.
  3. The novel techniques to extract the dynamic points and model the temporal opacity seem very interesting.

缺点

  1. From the results shown in Fig.6 and the supplementary materials, the extracted points from the 'dynamic points extraction' technique seem actually to be the points with color change. I think the results are not very concrete to support the contribution. More experiments in the scenes including objects with large movements might be helpful.
  2. From the proposed opacity modeling technique and the limitations claimed in Sec.6, it seems the proposed method can only work well in such scenes that the objects within the scenes are stable w/o new objects coming in. I would like to know the performance of the proposed method when some new objects come in while progressively processing. It seems that the results do not contain such scenarios.

问题

  1. In lines 238 and 239, why the top 2% points are converted? why consider the distant points as the potential bias?
  2. Can the authors provide more details on how the parameters of keyframes are optimized?
  3. In line 258, what does it mean by 'use only COLMAP point clouds from the first frame'?
  4. What if the movements between key frames are not regular motions?

局限性

Please refer to 'Weaknesses'.

评论

Table A. Quantitative results of the experiment on handling color changes without dynamic points.

ModelPSNRSSIMLPIPS
3DGS21.690.8510.126
3DGS + ours dynamic26.070.8910.089
Ours29.030.9220.068

 ~

 ~

Table B. Quantitative results on the Birthday scene in the Technicolor dataset.

Technicolor Birthday longerPSNRSSIMLPIPS
Ours29.120.9000.094

 ~

 ~

Table C. Experimental results on the conversion rates to dynamic points.

PercentPSNRSSIMLPIPS
0.532.550.9560.043
132.860.9570.042
233.040.9560.041
432.900.9560.042
831.720.9550.042

 ~

 ~

Table D. Experimental results on different keyframe intervals and skipped frames.

Skipped frames123
Keyframe intervalPSNRSSIMLPIPSSize (MB)PSNRSSIMLPIPSSize (MB)PSNRSSIMLPIPSSize (MB)
131.170.9480.05759531.470.9480.05641531.810.9460.051142
232.060.9520.05131432.330.9540.04932231.810.9530.044101
531.700.9530.04720632.290.9540.04312632.530.9540.04580
1033.040.9560.04111932.650.9560.0439331.790.9530.04674
2032.780.9550.0439032.070.9520.0477832.080.9530.04873
5032.140.9550.0467931.930.9510.0527230.910.9490.05673
作者回复

[W1] The main motivation of separating dynamic points is to allow FESGS to handle temporal changes with either color or displacement variations, so stationary objects but temporally changing colors are also trained as dynamic points.

We carry out an additional experiment for the case where points with color changes are regarded as static points. We train 3DGS by masking only the moving parts of N3V, and measure the performance gain (To measure overall gain, we replace dynamic parts with ours) on the remaining parts. Please refer to Table A for the detailed results. As shown in the evaluation, if we do not consider the points with color changes as dynamic points, it is hard to obtain satisfactory results.

In the case of large movements, we conduct an experiment with longer frames on the birthday scene in Figure D of the uploaded PDF file. The rendering result shows that FEDGS detects the movements of objects as well as the color changes. The quantitative result in detail for the large movements will be covered in the next answer.

[W2] We conduct an experiment with different frames of the Technicolor dataset to see this issue. The selected interval is from the timestamp when the person is completely invisible to the timestamp when the bag is placed on the desk. The result is reported in Table B and Figure D of the uploaded PDF file. In this case, the initialized point clouds only use the COLMAP of the first timestamp (no prior point cloud for the person). This result shows that FEDGS can learn about appearing objects on the scene. This is because the densification of 3DGS works well due to the explicit representation that allows splitting from neighboring Gaussians.

[Q1] This is because 2% of the static points with the biggest movement are related to the dynamic part. We note that the 2% is determined empirically, and we observe that extracting more points causes too many dynamic points and extracting fewer points result in not enough dynamic points. The performance change according to the ratio is shown in Table C.

[Q2] The keyframe is only associated with dynamic points. And, the parameters optimized in each keyframe are about the position and rotation values, and they are optimized by RAdam. Using Equation 7, the CHIP interpolates the position values stored in the four neighboring keyframes for the dynamic points at timestamp tt. The interpolated position is represented as a R3R^3 vector whose gradient can be obtained by the differential rasterizer of 3DGS. After obtaining the gradient, by Equation 6, the gradient of each keyframe's position can also be obtained by partial difference (we use PyTorch's autograd function) and used to update the position values stored in each keyframe. It is worth noting that thanks to CHIP, FEDGS can also learn the velocity change and momentum value of the dynamic point. Similarly, the keyframe rotation value can also be optimized by Equation 8 and the differential rasterizer in 3DGS. The difference with the keyframe position is that it uses two neighboring keyframes. [Q3] This means that for every timestamp, only the image from the earliest timestamp is used. For the N3V dataset, for example, the COLMAP result is provided for the first frame (about 20 images) out of a total of 300 frames. Note that 4DGS models or STGs use all frames (about 6000 images). This prevents COLMAP from taking an excessive computational time to optimize.

[Q4] Irregular motion can be handled by CHIP, given a small enough keyframe interval. CHIP uses the tangent and position of two neighboring keyframes. As shown in Equation. 7, the difference of positions between two adjacent keyframes is used for tangent. The position values are explicitly stored in each keyframe. With the two values, CHIP approximates dynamic motion as a polynomial. If the keyframe is too sparse, it becomes difficult to approximate the motion with CHIP (i.e higher degree polynomials are required), but if it is dense enough, most of the motion can be approximated. This is true even if the motion is velocity-variant or non-linear. This can be seen in the results in Table D. The experiment is performed on cook spinach scene in the N3V dataset. This scene includes a motion that changes direction, accelerates and decelerates. We also include experiments that skip frames to simulate larger movements (the more frames skipped, the larger the movement). These results show that if the keyframe of FEDGS is dense enough, it can be handled (even if the motion is not regular). In general, setting a keyframe interval of 10 is reasonable considering overall performance and storage size.

评论

Thanks to the authors for their explanations. However, there are still some left questions. Thanks to the authors for their explanations. However, there are still some left questions.

  1. In your answer to [W2], I think you want to claim that your proposed method can work well in scenarios with new objects appearing. However, you also claimed in limitations that that is one of the limitations. I am confused with it. It seems the qualitative result presented is good, but the quantitative result is not as good as the performance of your method on the other dataset.
  2. Please clarify which data does the experiments on the dynamic point ratio in Tab C are conducted on.
  3. I am still concerning the empirically selected dynamic points ratio and the intervals between keyframes might limit the generalization of the proposed methods.
  4. I am still confused about the answers to [Q3], what do you mean by 'the COLMAP result is provided for the first frame (about 20 images) out of a total of 300 frames'? Does it mean these 20 images are multi-view images at the same time?
评论

Thank you for your thoughtful feedback, your comments are valuable to us. Here are the answers to your questions:

  1. We mean that the newly appeared object mentioned in the Section 6 has "no relevant 3D Gaussians in neighboring frames". Here "relevant 3D Gaussians" refers to 3D Gaussians that are close to be densified from the object. These Gaussains are not only Gaussians corresponding to the same object, but also Gaussians in neighboring objects (in Figure D of the uploaded pdf, neighboring Gaussains of the person could be a table and decorations). So, it is true that FEDGS can handle new objects that appear. However, it is still difficult to represent suddenly appeared objects with no neighboring objects. Although we wanted to show this example, unfortunately, there is no scene where objects suddenly appear in the public datasets. Instead, as shown in Figure D of the uploaded pdf, we display the result on slowly appearing objects.
    In this rebuttal, we chose the video with a different time range for the additional experiment. The frame range includes newly appearing objects (Frame #50~#169). Therefore, the performance can be different between the additional experiments and the main paper (Frame #151~#200).

  2. We experiment Table C on the Cooking Spinach scene from the N3V dataset.

  3. First, for the static point extraction, the percentage does not determine the ratio of dynamic points. In our model, pruning and densification are applied to both static and dynamic points, automatically reducing or increasing the number of both. Even if we convert more static points to dynamic points, and if it is not excessive, the static points will be split and the dynamic points will be pruned to correct the erroneous conversion, and vice versa. Therefore, the extraction percentile is varied during optimization, not fixed value to determine the ratio of dynamic points. Actually, when we used the same extraction percentile, 56.4% of the points were finally considered as dynamic points on the Birthday scene from the Technicolor dataset, while 32.3% of the points were dynamic points on the Cook Spinach scene from the N3V dataset. Unlike the extraction percentile, we do not control the ratio of dynamic points; it depends on the dataset or scene.
    For more generality of the keyframe interval, as the SWinGS (ECCV 24), mentioned by reviewer Vpue, an adaptive keyframe interval selection based on a magnitude of optical flow between frames can be one possible solution. By using this, we can expect that dynamic keyframe allocation is available, varying keyframe intervals even in the same scene. In total, we agree with your concern on the generality and will mention it as one of our future works in the revised version of this paper.

  4. Exactly. In the case of the N3V dataset, it is a multi-view video dataset with 300 frames captured by approximately 20 time-synchronized cameras (the number of cameras varies from scene to scene). We only use the 20 images taken in the first frame.

We hope our answers will help you better understand our work. And if you have any other questions, don't hesitate to ask us.

审稿意见
6

The paper proposes a new method in the field of novel view synthesis for video input. The authors propose a Gaussian Splatting-based algorithm that introduces a fully explicit representation at keyframes and models interpolation of gaussians (position, rotation, opacity) between the frames. Additionally, the paper proposes to learn scene separation into static and dynamic parts in the progression of training. The method is evaluated on popular video novel view synthesis literature datasets.

优点

  • The idea of modelling the scene explicitly in keyframes is certainly an interesting choice - it has the potential to offer a solution to modelling long-range dependencies. In combination with the use of Gaussian Splatting, it is definitely a novel concept.
  • The choice of interpolators seems to be good and well supported through experiments, specifically two-gaussian opacity modelling is a neat way of dealing with objects appearing and disappearing.
  • Separation into static and dynamic Gaussians is a good idea and based on the results and visualisations seems to work well in the proposed methodology. The premise of the majority of the scene being modelled as static was explored in NeRF-based solutions. However, this work provides a novel solution of how to implicitly divide the scene into dynamic and static gaussians.
  • The description of the method is easy to follow.
  • The authors provide a large number of ablation experiments with many components of the methods focused on.
  • The proposed method reports very convincing training times, especially for training on videos.
  • The supplementary video presents the qualitative results in a very clear manner.

缺点

  • I would like to see more detailed descriptions of the approach throughout the paper. Mainly:
    • How does the temporal opacity modelling relate to the densification of Gaussians?
    • More details on keyframe selection would be useful - why fixed step, how is the step chosen.
    • What is the motivation of static gaussians having linear position interpolation?
  • It would be good to see the same detailed breakdown per scene for the Technicolor dataset. Similarly, the Technicolor experiment shows fewer comparisons to other methods (I would try to include more, at least Gaussian Splatting based ones).
  • The keyframe interpolation naturally draws attention to the method potentially being better at handling big movements or longer videos. Therefore, I think such a comparison is missing in this work. It could be done on the same datasets. Other works compare only short portions of the Technicolor dataset which hides the potential issues with longer videos and bigger movements, however, full scenes are available. Similarly, Neural 3D Video offers a very long sequence for the Flame Salmon scene.
  • Even though the ablation is extensive, there could be more insights on the results provided.
    • Firstly, we are missing the information on what is the ablation performed on - which dataset, one or more scenes.
    • Is w/o Dynamic point extraction treating everything as static, just with position linear interpolation?
    • It seems that all the components are highly relevant (removal of one degrades the quality by almost 0.5dB0.5dB at least). Is that a consistent result between multiple runs? Also, is this scene-specific or an average?
  • It would be great if the authors provided some statistical analysis of the results.

问题

  • HumanRF [1] pays particular attention to how to manage bigger and smaller movements in the video via non-uniform segmentation into manageable chunks - did you notice the high sensitivity of your method to keyframe interval?

  • Relating to mentioned more detailed descriptions:

    • It seems that only the initial set of gaussians is being modified throughout the training. Therefore, how does the appearance of a new object happen? For a big object, there have to be a lot of new gaussians. Also, all appearances in the video have to be modelled with two gaussians. Are there any issues when an object appears and disappears repetitively (e.g. the rotating decoration in the Technicolor Birthday scene)?
    • Point backtracking for pruning is not fully clear to me. To make sure, does D\mathcal{D} include all training views in all timesteps? Further, what is the pruning rule (is it carried over from the original Gaussian Splatting)?
    • Regarding linear position change for static gaussians - why is this necessary given the gaussians should be static? Is it to correct tiny movements? Is there no need to allow linear rotation interpolation as well?
  • I would encourage authors to check [2] - it seems that this method also models the gaussian representation explicitly in keyframes, however, takes a different approach for motion modelling. Note that this has only just been published at ECCV and does not affect my opinion on this paper, it may just be a worthwhile mention in related work of how the proposed method stands out with respect to others.

[1] Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, Matthias Nießner, HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion, ACM Transactions on Graphics (TOG), 2023

[2] Richard Shaw, Jifei Song, Arthur Moreau, Michal Nazarczuk, Sibi Catley-Chandar, Helisa Dhamo, Eduardo Perez-Pellitero, SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting, European Conference on Computer Vision, 2024

局限性

The authors provided a brief limitations paragraph. I believe the issue with new object appearance is a good mention.

评论

Table J. Experimental results on different keyframe intervals and skipped frames.

Skipped frames123
Keyframe intervalPSNRSSIMLPIPSsize (MB)PSNRSSIMLPIPSsize (MB)PSNRSSIMLPIPSsize (MB)
131.170.9480.05759531.470.9480.05641531.810.9460.051142
232.060.9520.05131432.330.9540.04932231.810.9530.044101
531.700.9530.04720632.290.9540.04312632.530.9540.04580
1033.040.9560.04111932.650.9560.0439331.790.9530.04674
2032.780.9550.0439032.070.9520.0477832.080.9530.04873
5032.140.9550.0467931.930.9510.0527230.910.9490.05673

 ~

 ~

Table K. Quantitative comparisons on the Train scene in the Technicolor dataset.

Technicolor Train longerPSNRSSIMLPIPS
STG32.170.9400.035
4DGS29.110.8770.119
4D Gaussians23.310.6570.385
Ours32.180.9380.044
评论

I'd like to thank the authors for their detailed response to my comments. I think that with respect to questions regarding explanations, I am fully clear in what authors did in their method (and I think this should be reflected in the final version of the manuscript). The breakdown per scene for Technicolor looks interesting as well (it invites some analysis/comments on why the authors believe their methods improve in particular scenes etc.). As for the long scene experiment , it would be great to see at some point a comparison with other methods (I am not asking for this within rebuttal, just believe that it can increase the value of the experimental section in the future).

评论

We are pleased that our responses help to clarify the understanding of our work. Your insights and suggestions are valuable for us. We are also interested in experimenting with long durations and see results when we use different hyperparameters/iterations. Additionally, we report the experimental result on the long duration video over the comparison methods below:

PSNRSSIMLPIPS
4DGS26.730.8990.112
4D Gaussians28.480.9050.095
Ours28.770.9190.076

We will add the additional results and analysis to the revised version.

评论

Table F. Statistical analysis (Mean ± STD) of ablation studies measured by PSNR.

PSNRCoffee MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
w/ Linear position28.04 ± 0.317432.26 ± 0.354733.02 ± 0.171228.93 ± 0.257432.92 ± 0.378233.46 ± 0.108131.44 ± 0.2645
w/ Linear rotation28.25 ± 0.137932.21 ± 0.126432.78 ± 0.412928.68 ± 0.395033.08 ± 0.241533.49 ± 0.238231.41 ± 0.2587
w/ Linear position&rotation28.17 ± 0.297432.48 ± 0.213832.61 ± 0.954528.99 ± 0.475032.92 ± 0.123033.27 ± 0.372631.41 ± 0.4061
w/o Dynamic point extraction27.59 ± 0.364328.95 ± 1.063729.44 ± 1.382927.43 ± 0.298829.18 ± 1.469131.39 ± 0.917629.00 ± 0.9161
w/o Temporal opacity28.18 ± 0.394732.23 ± 0.280732.18 ± 1.164228.81 ± 0.419932.84 ± 0.193533.58 ± 0.085531.30 ± 0.4231
w/o Progressive growing28.26 ± 0.278032.36 ± 0.180332.92 ± 0.381729.49 ± 0.220932.70 ± 0.680532.99 ± 0.390831.45 ± 0.3554
w/o Regularization28.06 ± 0.464732.39 ± 0.182032.70 ± 0.796828.72 ± 0.201133.10 ± 0.334533.37 ± 0.111131.39 ± 0.3484
w/o Point backtracking28.02 ± 0.347732.38 ± 0.326732.93 ± 0.291728.63 ± 0.618732.99 ± 0.165333.48 ± 0.117731.40 ± 0.3113
Ours28.43 ± 0.169132.77 ± 0.206033.31 ± 0.197229.11 ± 0.120233.41 ± 0.237933.23 ± 0.240631.71 ± 0.1952

 ~

 ~

Table G. Statistical analysis (Mean ± STD) of ablation studies measured by SSIM.

SSIMCoffee MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
w/ Linear position0.9180 ± 0.00190.9518 ± 0.00120.9549 ± 0.00080.9234 ± 0.00140.9621 ± 0.00100.9622 ± 0.00020.9454 ± 0.0011
w/ Linear rotation0.9178 ± 0.00100.9507 ± 0.00050.9550 ± 0.00080.9213 ± 0.00160.9621 ± 0.00050.9624 ± 0.00080.9449 ± 0.0009
w/ Linear position&rotation0.9174 ± 0.00200.9526 ± 0.00100.9551 ± 0.00050.9233 ± 0.00170.9620 ± 0.00060.9616 ± 0.00060.9453 ± 0.0011
w/o Dynamic point extraction0.9102 ± 0.00140.9345 ± 0.00480.9402 ± 0.00470.9166 ± 0.00150.9501 ± 0.00390.9519 ± 0.00260.9339 ± 0.0032
w/o Temporal opacity0.9164 ± 0.00160.9505 ± 0.00130.9530 ± 0.00040.9213 ± 0.00240.9612 ± 0.00060.9622 ± 0.00040.9441 ± 0.0011
w/o Progressive growing0.9166 ± 0.00250.9503 ± 0.00150.9556 ± 0.00080.9266 ± 0.00100.9611 ± 0.00180.9613 ± 0.00080.9453 ± 0.0014
w/o Regularization0.9179 ± 0.00130.9518 ± 0.00060.9545 ± 0.00170.9231 ± 0.00080.9628 ± 0.00110.9625 ± 0.00040.9454 ± 0.0010
w/o Point backtracking0.9176 ± 0.00080.9521 ± 0.00110.9547 ± 0.00150.9208 ± 0.00360.9622 ± 0.00070.9624 ± 0.00050.9450 ± 0.0014
Ours0.9154 ± 0.00130.9562 ± 0.00040.9572 ± 0.00040.9253 ± 0.00050.9621 ± 0.00070.9603 ± 0.00090.9461 ± 0.0007

 ~

 ~

Table H. Statistical analysis (Mean ± STD) of ablation studies measured by LPIPS.

LPIPSCoffee MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
w/ Linear position0.0723 ± 0.00110.0490 ± 0.00090.0428 ± 0.00030.0759 ± 0.00110.0350 ± 0.00060.0365 ± 0.00030.0519 ± 0.0007
w/ Linear rotation0.0723 ± 0.00140.0495 ± 0.00080.0441 ± 0.00110.0777 ± 0.00190.0351 ± 0.00020.0363 ± 0.00060.0525 ± 0.0010
w/ Linear position&rotation0.0720 ± 0.00290.0486 ± 0.00030.0437 ± 0.00110.0762 ± 0.00120.0354 ± 0.00020.0372 ± 0.00070.0522 ± 0.0011
w/o Dynamic point extraction0.0786 ± 0.00100.0749 ± 0.00510.0716 ± 0.00960.0836 ± 0.00140.0642 ± 0.00650.0617 ± 0.00700.0724 ± 0.0051
w/o Temporal opacity0.0725 ± 0.00230.0490 ± 0.00040.0452 ± 0.00100.0772 ± 0.00260.0354 ± 0.00070.0368 ± 0.00070.0527 ± 0.0013
w/o Progressive growing0.0699 ± 0.00210.0511 ± 0.00070.0443 ± 0.00050.0734 ± 0.00080.0384 ± 0.00360.0403 ± 0.00340.0529 ± 0.0019
w/o Regularization0.0713 ± 0.00300.0495 ± 0.00080.0437 ± 0.00120.0756 ± 0.00160.0343 ± 0.00040.0369 ± 0.00060.0519 ± 0.0013
w/o Point backtracking0.0716 ± 0.00190.0491 ± 0.00030.0452 ± 0.00380.0777 ± 0.00270.0350 ± 0.00070.0364 ± 0.00040.0525 ± 0.0016
Ours0.0721 ± 0.00190.0417 ± 0.00040.0416 ± 0.00050.0659 ± 0.00050.0346 ± 0.00100.0357 ± 0.00120.0486 ± 0.0009

 ~

 ~

Table I. Statistical analysis (Mean ± STD) on the N3V dataset.

ScenePSNRSSIMLPIPS
Coffee Martini28.43±0.1690.915±0.00130.0720±0.0009
Cook Spinach32.77±0.2060.956±0.00040.0420±0.0004
Cut Roasted Beef33.31±0.1970.957±0.00040.0416±0.0005
Flame Salmon29.11±0.1200.925±0.00050.0659±0.0005
Flame Steak33.41±0.2380.962±0.00070.0346±0.0010
Sear Steak33.23±0.2410.960±0.00090.0357±0.0012
Average31.71±0.1950.946±0.00070.0490±0.0009
评论

Table A. Experimental results on different keyframe intervals and skipped frames.

Keyframe intervalPSNRSSIMLPIPSSize (MB)
131.170.9480.057595
232.060.9520.051314
531.700.9530.047206
1033.040.9560.041119
2032.780.9550.04390
5032.140.9550.04679

 ~

 ~

Table B. Quantitative evaluation on the Technicolor dataset using PSNR.

PSNRBirthdayFabienPainterTheaterTrainAverage
DyNeRF29.2032.7635.9529.5331.5831.80
HyperReel29.9934.7035.9133.3229.7432.73
STG(Sparse)31.9734.5436.5030.5532.6633.25
4DGS28.0426.2233.8031.5027.9329.50
4D Gaussians30.8733.5634.3629.8125.3530.79
Ours32.5534.5436.5631.1131.7233.30

 ~

 ~

Table C. Quantitative evaluation on the Technicolor dataset using SSIM.

SSIMBirthdayFabienPainterTheaterTrainAverage
DyNeRF†0.9520.9650.9720.9390.9620.958
HyperReel0.9220.8950.9230.8950.8950.906
STG(Sparse)0.9420.8860.9250.8770.9420.915
4DGS0.9050.8670.9000.8740.8390.877
4D Gaussians0.9060.8670.8860.8480.7290.847
Ours0.9430.8840.9290.8750.9210.910

†: Use structural similarity function from scikit-image library

 ~

 ~

Table D. Quantitative evaluation on the Technicolor dataset using LPIPS.

LPIPSBirthdayFabienPainterTheaterTrainAverage
DyNeRF0.0670.2420.1460.1880.0670.142
HyperReel0.0530.1860.1170.1150.0720.109
STG(Sparse)0.0390.1350.0980.1220.0330.086
4DGS0.0890.1990.1380.1570.1660.150
4D Gaussians0.0880.1880.1620.1890.2720.180
Ours0.0430.1520.0890.1430.0710.100

 ~

 ~

Table E. Quantitative results on the extremely long duration video sequences (Flame Salmon) in the N3V dataset.

PSNRSSIMLPIPSSize (MB)
Ours28.770.9190.076392
评论

[Q3] Thanks for introducing a relevant paper. We have looked at this paper and will state the differences between it and this paper.

SWinGS disentangles static and dynamic scenes using MLP. For dynamic regions, the model slices the timestamp using a sliding window strategy and trains a deformation MLP of the canonical space by window. SWinGS has similarities with FEDGS in that it divides static and dynamic parts and slices timestamps, but the differences between them are as follows:

  • Detecting Dynamic part: SWinGS uses MLP to classify static and dynamic in a binary manner. It performs classification by thresholding L1 error and uses spatial location as input. FEDGS, on the other hand, determines dynamic points based on the magnitude by which the static gaussian moves. This can distinguish dynamic points even if two points are in the same location.
  • Dynamic representation: SWinGS uses an implicit function defined for each window to find the deformation in the canonical space of the dynamic grid. On the other hand, we store the dynamic motion of each point independently in an explicit way. Therefore, we do not need the coordinates of the canonical space. We only need the position information of the neighboring keyframes.
  • Keyframe selection: SWinGS uses the magnitude of the optical flow to measure the magnitude of the motion and divides the keyframes based on this. FEDGS, on the other hand, uses a fixed keyframe to take advantage of indexing, as described earlier. This interval is a hyperparameter.
  • Optimization: SWinGS adds a loss of pixel difference to achieve temporal consistency. In FEDGS, point backtracking for dynamic point pruning, conversion of static points to dynamic points, and progressive learning scheme are added.
评论

[Q1] The sensitivity of motions depends on keyframe intervals. However, instead of directly measuring motions, we decide to determine the keyframe intervals based on rendering quality. As explained in W1-2, the keyframe interval is based on both storage usage and rendering quality. Here, we assume that sufficiently small keyframe interval (in our cases, 10) can handle most of complex motion. We report the experimental evidence in Table J. To simulate and measure the motion speed, we intentionally skip frames in videos.

[Q2] Each answer is as follows:

  1. As described in W1-1, we adopt the densification algorithm of the original 3DGS model, and this densification is applied to both static and dynamic points. In addition, static points are periodically converted to dynamic points so that even if there is a new object, it can be treated as a dynamic point by splitting the Gaussian around it. In our experiments, the initial Gaussians start at about 7000, but by the end of training, the scene usually includes more than 200,000 Gaussians. Although densification can handle new objects as long as there are Gaussians around the dynamic object that can be split (e.g., static Gaussians can be converted to dynamic after splitting, or dynamic Gaussians can be split and optimized), unfortunately, when there is a sudden appearance of an object as described in Section 6, it is difficult to learn because there are no Gaussians around it to split. This is the same problem as the original 3DGS, which does not learn well unless there is a Gaussian to split around it (e.g., random initialization). Also, for objects that appear repeatedly, there is no way to determine if the disappearing object and the reappearing object are the same because its temporal consistency is required to determine this. Our model represents these as separate objects when an object is temporally discontinuous. In our experiment, the rotating decoration that is occluded and reappears, our model learns that the dynamic points disappear and new dynamic points appear in Figure C of the uploaded PDF file. In this figure, you can see that a number of point clouds changes before and after the object flips. This shows that FEDGS creates different Gaussians when the object reappears. Nevertheless, you can see that the rendering results look good. We further carry out an experiment with the presence of occlusion. We have added comparison evaluation in Table K. Figure A of the uploaded PDF file shows the comparison results when occlusion is present. We select 100 frames of the train scene in the Technicolor dataset and compare FEDGS with the other models. We use the first frame as input and do not use the point cloud information of other frames. The results show that FEDGS renders dynamic objects well, even when they disappear and reappear. 4DGS learns dynamic parts well, but struggles to render static parts, while STG struggles to render dynamic parts. 4D Gaussians fails to render dynamic objects.
  2. D\mathcal{D} represents the images of all timestamps in all training sets.The original 3DGS does not have it, and we add this because the opacity-based pruning method of the original 3DGS does not completely remove floaters caused by dynamic objects. This is because some dynamic objects are trained to move to an invisible area, instead of vanishing. This method measures how much each Gaussian contributes to the error in the rendered image, much like accumulating a gradient in densification. More specifically, using this operation on a rendered image, we can get the sum of the error (L1 or SSIM error) and the sum of the alpha values of all the Gaussians in the image. Dividing this cumulative error by the sum of the alpha values can normalize it. We threshold this error to remove Gaussians that cause large errors.
  3. The reason, why we assume that static points have linear motion, is to distinguish dynamic points, as explained in W1-3. This does not aim to model the motion of static parts exactly. We agree with your idea of modeling the rotation of the static points, but in our method, the magnitude of the position change is enough to distinguish dynamic points. Different from the original 3DGS model learning positional information of static points, all static points in FEDGS are approximated to have linear movement. And we measure a magnitude of the motion between frames to reassign static point to dynamic point if the magnitude is more than a predefined threshold value. Interestingly, although static points are represented as linear motions over time, they can explain all temporal changes, including color, rotation, and opacity. This is because the static points are considered as moving points during optimization if one of the temporal changes exists. Therefore, instead of using a more complex model to represent static points, we use a simple yet effective way that can further reduce the computational complexity.

Answer to [Q3] is followed by the next comment.

作者回复

[W1] Each answer is as follows. The revised version will reflect this description.

  1. FEDGS follows the densification algorithm of the original 3DGS. This algorithm accumulates the gradient magnitudes of visible Gaussians (that is, the gradient in the x,y direction of the image space) for a camera during training, selects Gaussians whose gradient magnitudes are greater than a threshold and applies splitting. Here, temporal opacity does not directly involve the densification, but increases or decreases the gradient in 2D direction with respect to visibility over time, allowing the dynamic Gaussian to consider only the gradient in the timestamp in which it is visible.

  2. The reason why we fix the keyframe interval is because of optimization issues. While a varying keyframe has advantages in terms of storage (e.g., allocate wider intervals for slow motion and narrower intervals for fast motion), it imposes a high cost on indexing, since the keyframe index of each Gaussian must be searched for every timestamp. On the other hand, if we use fixed timestamps, this cost can be greatly reduced by simple indexing (we can chunk keyframe data as a vector). We choose the fixed step keyframe because we believe that the benefit of reducing the computational cost outweighs the storage wasted by using fixed timestamps, and it is trained to be a reasonable size for the actual size of the storage. The keyframe interval is empirically chosen in consideration of rendering quality and efficiency. Shorter keyframes can handle more complex motion but are also easy to overfit and require more storage. On the other hand, large keyframe intervals often are inadequate to handle complex motion. We add the test result according to a variety of intervals on the N3V dataset in Table A.

  3. The reason, why we assume that static Gaussians have linear motions, is to distinguish dynamic points from the other static points. We observe that there is a high correlation between their spatial movements and dynamicity when we train a scene having dynamic motions. We find that when we apply only a linear transformation (i.e., Equation 5) to the 3DGS model, the points with the largest transformation closely coincide with the dynamic part (e.g., selecting 2% of the N3V dataset in order of size matches points corresponding to person). Of course, we could adopt more sophisticated assumptions for the movement, but the linear motion assumption is enough to account for it. We note that the linear motion is normalized based on a distance between each Gaussian and a camera. This aims to impose higher weights on an object close to the camera than a distant one if they have the same distance.

[W2] Thanks for pointing this out. We have added Table B, Table C and Table D with scene breakdown results of PSNR, SSIM and LPIPS respectively for the Technicolor dataset. The reason why there are no results for the other methods is that there are implementation issues because the Technicolor dataset is not officially supported. Nevertheless, we add some baselines to reflect this as best we can.

[W3] We test our model on a 1000-frame scene from the N3V dataset as recommended. We have added an quantitive evaluation in Table E. The rendered images can be found in Figure E of the uploaded PDF file This result shows reasonable performance and memory usage.

[W4] The answers for each item are shown below. We will clarify them in the revised version of this paper.

  1. We report the average results of all scenes in the ablation study N3V dataset.
  2. In w/o dynamic points, the linear motion transformation in Equation 5 is applied to the static points.
  3. As described in W4-1, our ablation reports average over all scenes in N3V dataset. We have added Table F, Table G and Table H with scene breakdown ablation results of PSNR, SSIM and LPIPS respectively for the N3V dataset. We ran the experiment 5 times for each ablation to get multi-run statistics of the ablation result, whose result is reported below. Compared to other cases, ours shows the highest mean PSNR and lowest STD values. Please check it, and please let us know it if you have more questions.

[W5] We report multi-run experiments on all scenes in the N3V dataset which reports only “ours” results of Table F, Table G and Table H. We run our method for 10 times per scene and report mean and STD for each. It shows that our method has lower STD values than other cases, which suggests that each component of our FEDGS contributes to stable performance. Please refer the Table I for detailed results.

The answers to [Q1, 2] are followed by the next comment.

审稿意见
5

This paper models dynamic scenes using 3DGS, unlike other methods that model dynamic scenes with both implicit and explicit representations. It proposes Fully Explicit Dynamic Gaussian Splatting (FEDGS), a method that models 4D scenes using a purely explicit approach. FEDGS employs a Cubic Hermite Interpolator to predict positions at different times and a Spherical Linear Interpolation for predicting rotations at different times. FEDGS can fit the dynamic scenes with sparse initialization in the N3V dataset without any auxiliary modules or tricks to encode lengthy temporal information.

优点

  1. Modeling 4D scenes with a purely explicit method is rare, and FEDGS can render at 120fps under 157MB, and the training speed is fast.
  2. FEDGS, as a purely explicit method, achieves better results in the case of sparse COLMAP point clouds. Many methods based on 3DGS for dynamic scene reconstruction face challenges with sparse initial point clouds.

缺点

  1. In the supplementary video, compared to other methods, many positions that should be static are moving, causing flickering in objects like hats and tables, and even some Gaussian spheres can be seen rotating. Is this a drawback of the method described in section 4.2.2? This is completely different from the results of previous methods like NeRFPlayer [69] that separate dynamic and static elements, and this part lacks further explanation and clarification.
  2. The most attractive aspect of FEDGS is its dynamic scene reconstruction results under sparse COLMAP point cloud conditions, which may be an advantage of the purely explicit approach. However, the paper does not clearly explain this, and providing more experiments could highlight the advantages of the purely explicit method, enhancing the paper's contribution.
  3. Compared to 4DGaussians, the rendering speed is slower. Is this due to slower predictions of positions and rotations for 3DGS? However, 4DGaussians use hexplane predictions, which should have a higher computational cost than formulas 6 and 9.

问题

Why the purely explicit approach can bring improvements under sparse COLMAP point cloud conditions is not clearly explained in the paper. I would be pleased to see the author address the issues mentioned in the weakness section, improving the quality and contribution of the paper.

局限性

After separating dynamic and static Gaussians, the rendering result of the static parts may experience jitter, which is the main limitation, and it's worth more exploration here.

作者回复

[W1] We thank you for your careful comment about the static points. In practice, we observe that there are temporal color and luminance changes even if their position does not change. In this work, our method is designed to handle all the temporal changes as dynamic points. This includes not only the movement of objects, but also changes over time, such as appearances, disappearances, and color changes. In some cases, such as Hats and tables in our supplementary video, there will be temporal shadows or changes in luminance at some timestamp. In these cases, dynamic points account for the light or rotation changes as well, thanks to our spherical harmonics. To validate this, we conduct an experiment with learning from a stationary object to be completely static in Figure B of the uploaded PDF file. We train 3DGS on the coffee martini scene with a mask on dynamic objects which are a person and a glass. We report the results of the test set when 3DGS is used alone and when 3DGS replaces the dynamic parts with our model. The numerical results are shown in Table A. As you can see, learning completely static points, even those that do not change position, leads to poor rendering quality because they cannot handle the other temporal information such as color changes or shadows. Therefore, it is beneficial to treat them as dynamic points.

[W2, Q1] The reason why we have an advantage in the sparse COLMAP condition is that the explicit representation makes the densification algorithm of 3DGS work better. Since the sparse COLMAP condition is trained with fewer points, the empty space needs to be filled with new Gaussians, which are split from existing Gaussian points. In 3DGS, the splitting algorithm works on points that threshold the gradient of the x-/y-direction of the image space. This gradient must be large enough for the Gaussian to be split. However, when using an implicit function, the deformation of the implicit function is optimized first, rather than the gradient of the Gaussian. This interferes with the Gaussians splitting, so implicit models require a more accurate 3D prior.

We perform several experiments in this regard. The following experiments show results for scenes that are difficult to train. In these experiments, the ability to reconstruct dynamic scenes from sparse COLMAPs implies that these scenes can also be handled by FEDGS.

First, we select the 100 frames including the occlusion of dynamic objects in the train scene of the Technicolor dataset and compare FEDGS with others. We use the point cloud prior of the first frame to give no information about the reappearing object after the object is occluded. The results are reported in Table B and the example is displayed in Figure A of the uploaded PDF. In these results, the explicit-based models STG, 4DGS, and Ours perform significantly better. In the case of STG, the dynamic part is not learned well as it goes to the later and later frames, and 4DGS may show that the dynamic part looks good, but it has difficulty in handling the static part which has negative impact on the overall performance. In particular, 4DGaussians, which is an implicit model, fails to disentangle the static and dynamic parts, resulting in missing rendering of the dynamic part. Our model, on the other hand, shows good performance and ability to learn both static and dynamic parts well.

Next, we carry out an experiment to see if FEDGS can learn about newly appearing objects. To handle new objects well, we need to split the dynamic part well. To test this, we select 120 frames from the Technicolor Birthday scene in which a person appears from nowhere. We use the point cloud prior to a frame where the person is invisible. The numerical results are shown in Table C and the rendered images are shown in Figure D of the uploaded PDF file. This result shows that the FEDGS model is beneficial for splitting Gaussians in dynamic scenes and is able to handle newly appearing dynamic objects.

Finally, we conduct an experiment with longer frames (1000 frames, 20,000 images in total) of the flame salmon scene from the N3V dataset. The result is reported in Table D and the rendered image in Figure E of the uploaded PDF file. This result shows that our model can learn well with reasonable storage, even for extremely long videos.

[W3] This is because 4DGaussians can be trained with fewer Gaussians on the N3V dataset than on complex scenes such as the Technicolor dataset. However, 4DGaussians suffers from a huge FPS drop when the minimum number of Gaussians needed is large (i.e. complex scene configurations). To prove it, we conduct an additional experiment on a more complicated scene, Technicolor, and the result is in Table E.

评论

Thanks for the authors' response!

I have carefully read the rebuttal, and most of my concerns have been addressed. Although Table E shows good rendering speed, it does not show the storage cost, which is 392MB in the complex scene shown in Table D. There is still a concern about the balance between speed and storage in the explicit method proposed in this paper.

评论

Thanks for your comments that our rebuttal addresses almost all your concerns. Here is the answer to your additional question:

We report the results on both the Technicolor dataset and the N3V dataset. As shown in the tables below, the 4D Gaussians model has a larger variation in FPS, despite having a similar storage size in both datasets. This validates that rendering speed is more dependent on the scene configuration than on storage size. Our method, on the other hand, shows consistent performance regardless of scenes.

We note that regarding the storage size, around 300MB should be manageable for most devices in these days (note that the storage size of public datasets used is about GB). As another aspect, we measure shared GPU memory of ours and the 4D Gaussian on a machine with one NVIDIA 2080Ti. The difference between them is marginal, which means both models can be worked on the same GPU machine.

Nevertheless, we recognize that our model takes more storage than 4D Gaussians. We will include an introduction of compression techniques [1, 2] into our model to save storage sizes as one of our future works. We expect that it will be feasible because our model preemptively separates a static part from a scene which is easily compressed.

 ~

Table F. Averaged results from Technicolor dataset

MethodPSNRSSIMLPIPSFPSSize (MB)GPU memory (GB)
4D Gaussians30.790.8470.18045441.3
Ours33.300.9100.100623071.6

 ~

Table G. Averaged results from N3V dataset

MethodPSNRSSIMLPIPSFPSSize (MB)GPU memory (GB)
4D Gaussians26.690.9230.074147341.4
Ours29.040.9400.0521211571.1

 ~

[1] Simon N, et. al., Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis, Conference on Computer Vision and Pattern Recognition, 2024

[2] Joo Chan L, et. al., Compact 3D Gaussian Representation for Radiance Field, Conference on Computer Vision and Pattern Recognition, 2024

评论

Table A. Quantitative results of the experiment on handling color changes without dynamic points.

ModelPSNRSSIMLPIPS
3DGS21.690.8510.126
3DGS + ours dynamic26.070.8910.089
Ours29.030.9220.068

 ~

 ~

Table B. Quantitative comparison on the Train scene in the Technicolor dataset.

Technicolor Train longerPSNRSSIMLPIPS
STG32.170.9400.035
4DGS29.110.8770.119
4D Gaussians23.310.6570.385
Ours32.180.9380.044

 ~

 ~

Table C. Quantitative results on the Birthday scene in the Technicolor dataset.

Technicolor Birthday longerPSNRSSIMLPIPS
Ours29.120.9000.094

 ~

 ~

Table D. Quantitative results on the extremely long duration video sequences (Flame Salmon) in the N3V dataset.

Flame Salmon longerPSNRSSIMLPIPSSize (MB)
Ours28.770.9190.076392

 ~

 ~

Table E. Quantitative evaluation on the Technicolor dataset.

MethodPSNRSSIMLPIPSFPS
4D Gaussians30.790.8470.18045
Ours33.300.9100.10062
审稿意见
5

The authors propose a fully explicit dynamic Gaussian splatting method, based on keyframe interpolation. The authors separate a dynamic scene into static Gaussians and dynamic Gaussians during training and apply interpolation techniques under temporal explicit representation, including a polynomial basis interpolator for position, a spherical interpolator for rotation, and a simplified Gaussian mixture model for opacity. Additionally, the authors introduce a progressive training scheme and a point-backtracking technique to improve the final convergence. The proposed method was validated on the Neural 3D Video dataset and Technicolor dataset, and exceeds the baselines.

优点

  1. The paper is well-written and easy to understand.
  2. The proposed designs are well-validated. The authors have done a lot of ablation experiments to validate their designs.
  3. As a fully explicit method, the paper shows impressive results regarding to rendering quality, model size, and rendering efficiency.

缺点

  1. Some important baselines are missing, including [1], [2], and [3]. All of them had released their code before the NeuriPS deadline. They have shown better quality than most baselines selected by this paper.

  2. Some metrics are missing. I am not sure why in Table 1, the authors only report PSNR. The average LPIPS and SSIM comparison should be included.

  3. The quality improvement is minor compared with some baselines. According to Table 1, the average PSNR is just about 0.2dB higher than some baselines.

[1]3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos

[2]4K4D: Real-Time 4D View Synthesis at 4K Resolution

[3]Im4D: High-Fidelity and Real-Time Novel View Synthesis for Dynamic Scenes

问题

  1. I am curious about the comparison results between this method and the baselines I mentioned in the weaknesses section.

  2. I am curious about the choice of keyframe interval. According to L259, the time interval is set to be 10. I am curious why the authors chose 10 to be the final hyperparameters. Will a different interval number like 5 or 20 affect the final quality?

局限性

The limitations have been well discussed by the authors.

评论

Table E. Experimental results on different keyframe intervals and skipped frames.

Skipped frames123
Keyframe intervalPSNRSSIMLPIPSSize (MB)PSNRSSIMLPIPSSize (MB)PSNRSSIMLPIPSSize (MB)
131.170.9480.05759531.470.9480.05641531.810.9460.051142
232.060.9520.05131432.330.9540.04932231.810.9530.044101
531.700.9530.04720632.290.9540.04312632.530.9540.04580
1033.040.9560.04111932.650.9560.0439331.790.9530.04674
2032.780.9550.0439032.070.9520.0477832.080.9530.04873
5032.140.9550.0467931.930.9510.0527230.910.9490.05673
评论

Table A. Quantitative evaluation on the N3V Dataset using PSNR.

PSNRCoffe MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
NeRFPlayer31.5330.5629.3531.6531.9329.1330.69
HyperReel28.3732.3032.9228.2632.2032.5731.10
Neural VolumesN/AN/AN/A22.80N/AN/A22.80
LLFFN/AN/AN/A23.24N/AN/A23.24
DyNeRFN/AN/AN/A29.58N/AN/A29.58
HexPlaneN/A32.0432.5529.4732.0832.3931.71
K-Planes29.9932.6031.8230.4432.3832.5231.63
MixVoxels-L29.6332.2532.4029.8031.8332.1031.34
MixVoxels-X30.3932.3132.6330.6032.1032.3331.73
Im4DN/AN/A32.58N/AN/AN/A32.58
4K4DN/AN/A32.86N/AN/AN/A32.86
Dense COLMAP point cloud input
STG28.4132.6232.5328.6133.3033.4031.48
4DGS28.3332.9333.8529.3834.0333.5132.01
4DGaussians27.3432.4632.9029.2032.5132.4931.15
Sparse COLMAP point cloud input
STG27.7131.8331.4128.0632.1732.6730.64
4DGS26.5132.1131.7426.9331.4432.4230.19
4DGaussians26.6931.8925.8827.5428.0731.7328.63
3DGStream27.7533.3133.2128.4234.3033.0131.67
Ours29.0432.4633.2129.5633.2333.8431.89

 ~

 ~

Table B. Quantitative evaluation on the N3V dataset using SSIM1_{1}.

SSIM1_{1}Coffe MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
NeRFPlayer0.9510.9290.9080.9400.9500.9080.931
HyperReel0.8920.9410.9450.8820.9490.9520.927
Dense COLMAP point cloud input
STG0.9160.9520.9540.9180.9600.9610.944
4DGSN/AN/A0.9800.960N/AN/A0.970
4DGaussians0.9050.9490.9570.9170.9540.9570.940
Sparse COLMAP point cloud input
STG0.9040.9460.9460.9130.9540.9550.936
4DGS0.9020.9480.9470.9040.9540.9550.935
4DGaussians0.8930.9440.9130.8960.9460.9460.923
Ours0.9150.9470.9480.9170.9560.9590.940

 ~

 ~

Table C. Quantitative evaluation on the N3V dataset using SSIM2_{2}.

SSIM2_{2}Coffe MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
Neural VolumesN/AN/AN/A0.876N/AN/A0.876
LLFFN/AN/AN/A0.848N/AN/A0.848
DyNeRFN/AN/AN/A0.960N/AN/A0.960
HexPlaneN/A0.9830.9850.9800.9880.9860.984
K-Planes0.9530.9660.9660.9530.9700.9740.964
MixVoxels-L0.9510.9680.9660.9490.9710.9760.964
MixVoxels-X0.9540.9680.9710.9530.9730.9760.966
Im4DN/AN/A0.970N/AN/AN/A0.970
4K4DN/AN/A0.972N/AN/AN/A0.972
Dense COLMAP point cloud input
STG0.9100.9470.9500.9130.9560.9580.939
4DGSN/AN/AN/AN/AN/AN/A0.972
Sparse COLMAP point cloud input
STG0.8980.9400.9390.9070.9490.9500.931
4DGS0.8940.9440.9430.8960.9510.9510.930
4DGaussians0.8860.9390.9070.8890.9420.9420.917
Ours0.9220.9510.9530.9230.9600.9630.945

 ~

 ~

Table D. Quantitative evaluation on the N3V dataset using LPIPS.

LPIPSCoffe MartiniCook SpinachCut Roasted BeefFlame SalmonFlame SteakSear SteakAverage
NeRFPlayer0.0850.1130.1440.0980.0880.1380.111
HyperReel0.1270.0890.0840.1360.0780.0770.096
Neural VolumesN/AN/AN/A0.295N/AN/A0.295
LLFFN/AN/AN/A0.235N/AN/A0.235
DyNeRFN/AN/AN/A0.083N/AN/A0.083
HexPlaneN/A0.0980.098
K-Planes0.0240.0170.0170.0240.0150.0130.018
MixVoxels-L0.1060.0990.0880.1160.0880.0800.096
MixVoxels-X0.0810.0620.0570.0780.0510.0530.064
Dense COLMAP point cloud input
STG0.0690.0430.0420.0630.0340.0330.047
4DGSN/AN/A0.041N/AN/AN/A0.055
Sparse COLMAP point cloud input
STG0.0870.0560.0600.0740.0460.0460.062
4DGS0.0790.0410.0410.0780.0360.0370.052
4DGaussians0.0950.0560.1040.0950.0500.0460.074
Ours0.0680.0490.0480.0750.0350.0360.052
作者回复

[W1, Q1] Thank you for letting us know the relevant papers and we believe that additional comparison with them makes this paper solid. We have thoroughly reviewed these models and will include them in the revised version. You can find the updated results in Table A and we report available values. For 3DGStream [1], we have directly referred to the results from their paper. Similar to 4K4D [2] and Im4D [3], the results have been taken from the STG paper [4].

While the works showcase impressive results and offer valuable insights that we deeply respect. However, our approach does have certain advantages over theirs. First, 3DStream [1] does not support random access to time, which means you cannot directly access the desired scene at any arbitrary timestamp. Moreover, 3DStream requires a trained 3DGS model for initialization. Second, both 4K4D [2] and Im4D [3] require background scenes for training. Unfortunately, the N3V dataset does not provide background scenes, our reproduction of Im4D 30.34dB for PSNR.

[W2] Thanks for pointing this out. We attach SSIM and LPIPS in Table B, Table C and Table D. Our results include the models mentioned in W1. We will add it to the revised version. Note that there are two implementation for SSIM metric. SSIM_1\_{1} is structural similarity function from scikit-image library and SSIM2_{2} is implementation borrowed from 3DGS codebase.

[W3] Other NeRF baselines may have comparable PSNR, but they have drawbacks in rendering/training time or model size. It can also be seen that there is a larger gain in the sparse initial 3D from COLMAP, compared to the other Gaussian Splatting baselines.

[Q2] While final rendering quality depends on the keyframe selection, they are also related to efficiency. Shorter keyframe intervals improve rendering quality because they allow more complex motions to be approximated, but if the keyframe interval is small enough, further reducing the interval is easy to overfit and requires a larger storage size. Therefore, we need to select an appropriately large enough keyframe interval. We report the numerical evaluation of FEDGS with respect to the keyframe interval in Table E.

We note that the empirical setting can be varied according to properties of datasets. We promise that the evaluation result will be included in the revised version of this paper.

[4] Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaussian feature splatting for real-time dynamic view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

评论

Thanks for your replies.

Although the papers have shown some strengths and the rebuttal has answered some of my concerns, it still misses some important baselines as I mentioned, and its overall performances on different metrics (PSNR, LPIPS, and SSIM) are not SOTA. Therefore, I am inclined to maintain my score.

I strongly suggest the authors include more baselines and metrics in the revised paper.

评论

We are sorry for the unsatisfactory response for your concern. However, it is notable that we have tried to reproduce the comparison baselines that you suggest. To be specific, both 4K4D and Im4D require trained backgrounds of scenes, but they are not provided in their official source codes, which makes it impossible to run them. For 3DGStream, we also have used the authors’ provided code, but were unable to see the same performance as reported in the original paper: we just see PSNR 32.55dB on Cook Spinach scene, which is lower performance than reported 33.31dB. We are now contacting the authors to investigate these issues, and will make it clear in the revised version.

评论

Thanks for your reply.

I can understand why some baselines are missing for now. Another important issue is that even without comparing with the baselines I suggested, the proposed methods can't always be SOTA on every metric. For example, even though K-Planes has a slightly lower PSNR, its performance is much better than the proposed methods regarding LPIPS and SSIM_2.

评论

We sincerely apologize that we made a mistake while directly referring to the quantitative metrics of LPIPS from STG paper (the scene-specific LIPIS for K-Planes is not reported). We ask you to check the Table.11 in STG paper. The original metrics that we want to directly refer is below:

PSNRSSIMLPIPS
K-Planes31.630.964N/A
STG31.480.9390.047
Ours31.890.9450.052

As shown in the table above, we would like to highlight that K-Planes does not show SoTA results. That’s, with respect to PSNR, our method achieves SoTA results. In contrast, STG and K-Plane show the best results in terms of LPIPS and SSIM, respectively. At this point, we agree with your opinion that our method does not produce the best result in every evaluation metric.

In addition, we would like to describe the root of this problem. 4D Gaussians and K-Plane only report per scene breakdown for both PSNR and SSIM values, and 3DGStream and 4DGS only show PSNR. As discussed in Appendix C.3. of HyperReel, there is no consistent standard implementation of SSIM used. As an example, a work in [1] uses their own implementation of SSIM, not library function. Therefore, to reliably collect all the PSNR, SSIM and LPIPS values of the comparison methods, we had to refer to them from multiple papers, leading to the mistake.

Lastly, we clearly indicate references of the PSNR, SSIM and LPIPS values used in this rebuttal below (Method - Reference):

  • PSNR
    NeRFPlayer, HyperReel, Neural Volumes, LLFF, DyNeRF, HexPlane, K-Planes, MixVoxels-L, MixVoxels-X - Table 11 of STG paper
    Im4D, 4K4D - Table 6 of STG paper
    4DGS - Table 4 of 4DGS paper
    4DGaussians - Table 6 of 4DGaussians paper
    3DGStream- Table 5 of 3DGStream paper

  • SSIM1
    NeRFPlayer, HyperReel - Table 11 of STG paper
    4DGS - Table 3 of 4DGS paper
    4DGaussians - Table 6 of 4DGaussians paper

  • SSIM2
    Neural Volumes, LLFF, DyNeRF, HexPlane, K-Planes, MixVoxels-L, MixVoxels-X - Table 11 of STG paper
    Im4D, 4K4D - Table 6 of STG paper

  • LPIPS
    NeRFPlayer, HyperReel, Neural Volumes, LLFF, DyNeRF, HexPlane, MixVoxels-L, MixVoxels-X - Table 11 of STG paper
    Im4D, 4K4D - Table 6 of STG paper
    4DGS - A rebuttal experiment for “Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting” in OpenReview of ICLR’24

[1] Wang, L, et al. "Fourier plenoctrees for dynamic radiance field rendering in real-time". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

作者回复

First, we would like to thank all the reviewers for giving us the valuable opinions for our paper. All reviewers agree that FEDGS is comparable in terms of performance and efficiency to other models. Reviewers Tqxz and Vpue comment on the clarity of the paper. Reviewers Tqxz, Vpue, and MZ9L highlight the novelty of our fully explicit model. Reviewers Vpue and MZ9L point out that keyframe-based interpolation is interesting, and that treating static and dynamic separately is a good idea. In response to the reviewer's questions, we further analyze our models with various situations, including replacing dynamic part, reappearing object, occluded object, appearance from void, and extremely long duration. We have uploaded these analyses and additional results to support them as a PDF file. We hope all the reviewers carefully check and give us more comments.

最终决定

The paper is on creating and reconstructing a dynamic 3DGS model. The idea is interesting but several important weaknesses were initially raised by the reviewers. The discussion clarified most of these weaknesses and an accept consensus was reached. The authors are encouraged to include the contents of the discussion in the revised paper, in particular the additional results provided.