5.3

/10

Poster4 位审稿人

最低5最高6标准差0.4

4.5

置信度

正确性2.8

贡献度2.8

表达2.8

NeurIPS 2024

Grid4D: 4D Decomposed Hash Encoding for High-Fidelity Dynamic Gaussian Splatting

Jiawei Xu,Zexin Fan,Jian Yang,Jin Xie

OpenReview PDF

提交: 2024-05-11更新: 2024-12-25

TL;DR

This paper proposes Grid4D, a novel high-fidelity dynamic scene rendering model with 4D decomposed hash encoding.

摘要

关键词

Dynamic Scene RenderingGaussian SplattingHash EncodingExplicit Representation

评审与讨论

审稿意见

评分: 6置信度: 52024-06-12

This paper proposes using Hash Encoding to model the Deformation Field for dynamic scenes. The authors first decompose 4D encoding into four 3D encodings to avoid the losses caused by the low-rank tensor assumption. They also introduce an attention module to decouple spatial and temporal features. Since explicit modeling can result in insufficiently smooth delta predictions for the deformation field, the authors incorporate a smooth loss to strongly regularize the outputs of Hash Encoding. Experiments on both synthetic and real-world datasets show substantial quality improvements compared to prior state-of-the-art methods. In addition, Grid4D shows a significant FPS improvement compared to Deformable-GS (without lightweight).

优点

I think the main strength of this paper is the authors' deep understanding of Deformable-based Gaussian splatting. To be compatible with the densification of vanilla Gaussian splatting, the deformation field needs to output delta (x or rotation). The key to the success of outputting delta lies in the sufficiently smooth output of the deformation field. This is why Deformable-GS [1] can outperform D-4DGS [2] significantly on datasets with accurate camera poses, since Hexplane is not as smooth as MLP. In addition, Hexplane not satisfying the low-rank tensor assumption in dynamic scenes is also a potential reason. The authors insightfully addressed the issue of MLP being overly smooth and explicit methods being insufficiently smooth. Therefore, they used Hash Encoding, which does not require the low-rank tensor assumption, to ensure high-frequency details, while employing smooth loss to ensure the foundation of the deformable-based Gaussian splatting.

The design of the deformation network and the experiments with strong baselines are compelling. Ablations are also appreciated.

[1] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101,2023.

[2] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.

缺点

low-rank assumption -> low-rank tensor assumption is more accurate.
I think the following papers should also be cited:
- Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis by Zhan Li et al.
- 4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes by Yuanxing Duan et al.
Datasets:
- D-NeRF: I think Lego (wrong scene with inconsistent training and test set) should not appear in the table, or it should use the Deformable-GS [1] setup with the validation set as the test data. This can prevent misleading subsequent research.
- HyperNeRF: I do not find the results on HyperNeRF convincing because the camera poses in the HyperNeRF dataset are inaccurate. For example, in the 3D Printer (Fig. 5), Grid4D is visibly clearer than TiNeuVox, but the metrics do not reflect this. Therefore, I suggest that the authors adopt NeRF-DS [3] dataset with more accurate camera poses for real-world scenes comparison.
More difficult motion: the motions in the datasets used for experiments are mostly rigid or easy to explain. It would be interesting to see if the method is able to handle more difficult deformations, like non-rigid or large motion. Based on my understanding of the Deformation Field, it is challenging for it to handle large motions effectively.
More comparisons: I wonder if the authors have considered SC-GS [4]. It could be better if the authors could compare their method with SC-GS. I believe the approach of SC-GS could be applied to Grid4D to achieve higher rendering quality and FPS.

[3] Zhiwen Yan, Chen Li, and Gim Hee Lee. Nerf-ds: Neural radiance fields for dynamic specular objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8295, 2023.

[4] Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse controlled gaussian splatting for editable dynamic scenes. arXiv preprint arXiv:2312.14937, 2023.

问题

A minor question: Lines 227-231 state that different scenes use different hyperparameters. I would like to know how significantly these different hyperparameters affect the scenes, as this greatly impacts the method's versatility.

局限性

Please refer to the Weakness section.

作者回复

2024-08-02

Thank you very much for the constructive feedback. We hope that our response below will address your concerns.

Q1: Citation of the two related works and writing problem.

A1: We will add the citation of the mentioned related works and change the presentation in the final version.

Q2: Should the metrics of the wrong scene 'Lego' be removed ?

A2: We will remove the metrics of the 'Lego' scene in the final version.

Q3: Use the NeRF-DS dataset with more accurate camera poses than the HyperNeRF dataset ?

A3: According to the original paper, we think that NeRF-DS dataset is designed for dynamic scenes with specular objects which is not suitable for general dynamic scene rendering. Therefore, to further demonstrate the improvements of Grid4D, we also choose the Neu3D dataset. Neu3D is captured by 21 cameras with fixed poses and has more accurate camera poses than HyperNeRF. Additionally, the scenes in the Neu3D dataset are larger and more challenging.

We report the comparison results with PSNR on the Neu3D[1] dataset in the following table, and our model has better performance than the state-of-the-art models. The qualitative results can be found in Figure 1 of the top comment PDF. We will add these experiments to our paper in the final version.

Model	Coffee Martini	Cook Spinach	Cut Beef	Flame Salmon	Flame Steak	Sear Steak	Mean
4D-GS	27.34	32.50	32.26	27.99	32.54	33.44	31.01
Grid4D(Ours)	28.30	32.58	33.22	29.12	32.56	33.16	31.49

Q4: How does Grid4D perform when handling difficult motions.

A4: The 'Hand', 'Peel Banana', 'Chocolate', 'Broom', and 'Teapot' scenes in HyperNeRF dataset have large and complex motions. As listed in Table 5, Figure 9, and Figure 10 of our paper, our methods have a much better rendering quality than the state-of-the-art models due to the discriminability improvement of explicit features. However, Grid4D still has several artifacts in the 'Broom' and 'Teapot' scenes, which might need further research.

Q5: Comparison to SC-GS.

A1: We conduct the comparison with SC-GS on the D-NeRF dataset. The comparison results with PSNR can be found in the following table, and the qualitative results can be found in Figure 2 of the top comment PDF. Notably, the original SC-GS rendering resolution is $400\times 400$ , lower than our model, so we changed the rendering resolution of SC-GS to $800\times 800$ for fairness. Additionally, we remove the 'Lego' scene because of the incorrect test ground truth. We will add these experiments in the final version.

Model	Bouncing Balls	Hell Warrior	Hook	Jumping Jacks	Mutant	Standup	Trex	Mean
SC-GS	41.59	42.19	38.79	39.34	43.43	46.72	39.53	41.65
Grid4D(Ours)	42.62	42.85	38.89	39.37	43.94	46.28	40.01	42.00

According to the results, Grid4D has better performance than SC-GS on average. The improvement of Grid4D might be because of the performance gap between implicit deformation fields (which SC-GS is based on) and Grid4D. SC-GS uses a full MLP-based implicit model to predict deformations, which has the over-smooth inherent property. Our proposed 4D decomposed hash encoder generates explicit features with high discriminability to represent the deformations more accurately.

We consider that SC-GS is an excellent work mainly focused on dynamic scene editing and rendering. Therefore, combining the advantages of SC-GS and Grid4D or applying Grid4D to SC-GS are probably great future works that might get better rendering quality.

Q6: What hyperparameters are set in different scenes ?

A6: In our experiments, the hyperparameters that need to change mainly include the resolution of the time dimension, the weight of smooth regularization term $\lambda_r$ , and the perturbation range $\epsilon=(\epsilon_x, \epsilon_y, \epsilon_z, \epsilon_t)$ .

The resolutions of the time dimension are set according to the frame number, usually about a half and a quarter of the frame number according to the Nyquist-Shannon Sampling Theorem. The resolution can be set automatically by the algorithm.

The weight of smooth regularization term $\lambda_r=0.5$ for most scenes, $\lambda_r=1.0$ for the 'vrig' part of HyperNeRF dataset. In most cases, setting $\lambda_r=0.5$ could obtain reliable results. For better performance, increasing $\lambda_r$ can help Grid4D render better when the scene primarily consists of rigid objects and simple motions. Similarly, decreasing $\lambda_r$ can help Grid4D for modeling complex motions and non-rigid objects.

For D-NeRF dataset and the 'vrig' part of HyperNeRF dataset, we set the perturbation range into $\epsilon\in [-10^{-2}, 10^{-2}]$ , For the 'interp' part of HyperNeRF dataset, we set the perturbation range into $\epsilon\in [-10^{-3}, 10^{-3}]$ . In most cases, setting the range to $\epsilon\in [-10^{-3}, 10^{-3}]$ can obtain reliable results.

[1] Li, Tianye, et al. "Neural 3d video synthesis from multi-view video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

评论- Feedback from reviewer T422

2024-08-08

Thanks for authors' response and the additional experimental results based on the review. I will not dwell too much on the authors' measurements of SC-GS rendering metrics. Overall, the rendering quality of SC-GS and Grid4D is similar (please refer to the homepage of MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting).

However, the authors have underestimated the FPS and training time of SC-GS. For instance, the FPS measurement of SC-GS should fix the KNN, rather than querying it in every iteration, achieving nearly 300 FPS on D-NeRF at 400x400 resolution. The training time is about 40-50 minutes, far less than 75 minutes. Moreover, I believe 4DGS (FDU, real-time 4DGS) should be compared in Neu3D dataset.

Considering my personal concerns about the future of deformation-based Gaussian splatting, the per-scene optimized hyperparameters by the authors, and the incremental (or lack of) improvement compared to SC-GS, I have decided to lower my score to weak accept.

2024-08-10

Thank you very much for the constructive feedback. We hope that our response below will address your concerns.

Q1: Further Comparison to SC-GS.

Although our model has no obvious improvements in several scenes on D-NeRF dataset in comparison to SC-GS, our model still significantly outperforms the state-of-the-art model on the HyperNeRF dataset, as shown in Table 3 of our paper. According to Section 7 in the paper of SC-GS, SC-GS might fail to reconstruct dynamic scenes with imprecise camera poses while our model is more robust in facing this problem.

We apply SC-GS to the real-world HyperNeRF dataset. We set the resolution to $960\times 536$ . On several successfully reconstructed HyperNeRF scenes of SC-GS, SC-GS obtains 25.58 PSNR in the 'Chicken(interp)' scene while Grid4D(Ours) obtains 27.31 PSNR; SC-GS obtains 24.00 in the 'Slice Banana' scene while Grid4D(Ours) obtains 25.79 PSNR. The results demonstrate the robustness of our model when facing the imprecise camera poses.

We think that the reconstruction failures and degradation of SC-GS might be because of the over-smooth inherent property of the implicit MLP-based models, as mentioned in Section D of the supplementary of DeformGS.

Q2: Are the FPS and training time of SC-GS underestimated ?

In all our experiments on the D-NeRF dataset including training and rendering, we set the resolution to $800\times 800$ for fairness, which is higher than $400\times 400$ of the original SC-GS. Therefore, the results of FPS are lower and the training time is higher.

We think only caching the KNN results of SC-GS from the rendering of the first frame might be a little bit unfair to Grid4D because we can do a similar caching operation in our model by caching the directional attention score. By using the caching operations, we obtain 234FPS of SC-GS and 200FPS of our model.

评论- Feedback from reviewer T422

2024-08-11

Thanks so much for the authors' response. However, my concerns about deformation-based Gaussian splatting have not been fully addressed.

The authors claim that Grid4D can better handle scenes with inaccurate camera poses, but as shown in Fig. 5 and Fig. 7, Grid4D still appears blurry in many details. Moreover, none of the qualitative comparisons include HyperNeRF, even though HyperNeRF typically produces clearer results compared to GS-based methods in HyperNeRF dataset.
Since quantitative metrics like PSNR heavily penalize slight misalignments more than blurriness, HyperNeRF may not show a significant advantage in these metrics. Therefore, using rendering metrics in scenarios with inaccurate camera poses is not convincing.
Explicit representations, such as 3D-GS, are inherently very sensitive to inaccurate camera poses, and the strong regularization in deformation-based Gaussian splatting exacerbates this issue. I am confident that using spacetime-GS and 4D-GS (ICLR 24) would result in better visual clarity in HyperNeRF scenes, as they decouple the time feature to each GS.
I believe that HyperNeRF's motion is not intense, long-term, nor complicate. Additionally, deformation-based Gaussian splatting struggles to handle the sudden appearance of objects, which can be better managed by 4D-GS and spacetime-GS.

2024-08-13

Thank you very much for the time and the constructive feedback. We hope that our response below will address your concerns.

Q1: Is PSNR reliable when facing imprecise camera poses ?

Although PSNR might not be reliable to evaluate the rendering of the details when the camera poses are imprecise, we visualize a lot of rendering results in Figure 9 and Figure 11 of our paper to demonstrate the improvements of our model on the HyperNeRF dataset. We will add the qualitative comparison to SC-GS in the final version. However, our model still has several artifacts when facing such imprecise camera poses, which might need further research.

Q2: A discussion of deformation-based models.

The fully explicit methods like Spacetime-GS are excellent models to reconstruct dynamic scenes and achieve high quality and fast rendering. However, we think that fully explicit models and deformation-based methods (like 4DGS, DeformGS, SC-GS and our Grid4D) might have their own advantages in handling different cases.

Deformation-based models might be more suitable on the scenes captured by few cameras which have limited viewpoints in each frame to display the motions. The smoothness of the deformation neural network could make the neighbouring Gaussians have similar deformations, which might be a prior to help the model better fit the motions. However, in fully explicit models, the decoupled features might be hard to keep such spatial smoothness, resulting in artifacts. Therefore, we think that deformation-based models might perform better on the monocular datasets, such as D-NeRF.

Fully explicit models might be more suitable on the scenes captured by multiple cameras which have enough viewpoints in each frame to display the motions. The decoupled features representing the deformations are very flexible and could model the motions more accurately. However, in deformation-based models, the smoothness of the neural networks might hinder from the accurate fitting. Therefore, we think that fully explicit models might perform better on the datasets captured by multiple cameras, such as Google Immersive.

Our model mainly handles the over-smooth problem of implicit deformation neural networks and improves the discriminability of the encoded features for better motion fitting. Therefore, we think our model might outperform other deformation-based models in most cases. However, compared to the fully explicit models, we think that our model could perform better on the datasets captured by few cameras, but might obtain slightly lower rendering quality on the datasets captured by multiple cameras. The above discussions still need further research.

评论- A kind comment

2024-08-13

Dear authors and reviewer T422, I would like to join the discussion and kindly defend the deformation-based GS:

Recent works, such as "Shape of Motion" (Wang et al., ArXiv24) and "MoSCA" (Lei et al., ArXiv24), seem to excel in monocular setups through explicit deformation models. I believe these methods could handle HyperNerf's setups. Therefore, I don't think 4DGS (Yang et al.) and Spacetime GS are similar methods. However, as the authors mentioned, the implicit global deformation network also has its advantages. Generalizability could be another considerable potential advantage.

Exploring implicit representations is an important research topic in 3D vision. While NeRF has its own advantages compared to 3D-GS, even though most researchers currently prefer the latter, we must remember that CNNs eventually replaced handcraft-designed convolutional kernels.

审稿意见

评分: 5置信度: 52024-06-23

This paper proposes a Grid4D representation for dynamic scene rendering. It breaks low-rank assumptions on 4D-GS and propose a decomposed 4D hash-grid representation for encoding canonical 3D Gaussians. A attention module is used to aggregate the spatial-temporal features of 3D Gaussians. More training strategies are applied in Grid4D to ensure efficient training and rendering quality are also better than 4D-GS, and deformable-GS.

优点

The paper proposes a deformable representation for dynamic scene rendering and without relying on low-rank assumption.
The attention module looks reasonable for feature aggregation.
A smooth training strategy is employed to ensure the smoothness of Grid4D.

缺点

Why don’t you compare with [13]? SC-GS now preserves the state-of-the-art (SOTA) rendering quality in D-NeRF.
If you choose 4D-GS as your baseline, please provide the complete results on the Neu3D dataset and all other metrics (average training time, average rendering speed, average storage consumption).
Only selecting some figures and reporting PSNR/LPIPS/SSIM for comparison is not quite enough in 3D vision. The reviewer recommends that the authors submit more comparison videos as supplementary material.
If the authors want to propose a method that can replace the deform-based dynamic Gaussians (4D-GS, SC-GS, deform-GS) and become a new baseline, they should provide strong, comprehensive results in the experiment section.
Additionally, the D-NeRF dataset is just a toy. I don't think much efforts are deserved to used in the dataset after the publication of SC-GS (PSNR already up to 40). If the authors are still interested in dynamic scene novel view rendering, I recommend focusing on monocular setups (DyCheck, NeurlPS2022) or multi-view (Neu3D, CVPR 2022).
The figure looks unattractive.

问题

I want the authors to demonstrate that Grid4D is better than 4D-GS, Deformable-GS, and SC-GS in most evaluation metrics. Please use average results instead of the range 0.3~1.5 in Table 7, as that range is too large. If some metrics are worse, please provide more discussions.
Solving a low-rank problem to achieve better rendering quality in the D-NeRF/NeRF-DS/HyperNeRF dataset is indeed a contribution. However, the reviewer believes that the core challenges of novel view synthesis in this dataset have already been solved. There are many important topics in dynamic scene novel view rendering, such as large motion and monocular novel view synthesis.

局限性

As mentioned in weakness and questions.

作者回复

2024-08-02

Thank you very much for the constructive feedback. We hope that our response below will address your concerns.

Q1: Comparison to SC-GS.

Model	Bouncing Balls	Hell Warrior	Hook	Jumping Jacks	Mutant	Standup	Trex	Mean
SC-GS	41.59	42.19	38.79	39.34	43.43	46.72	39.53	41.65
Grid4D(Ours)	42.62	42.85	38.89	39.37	43.94	46.28	40.01	42.00

We think that SC-GS is an excellent work mainly focused on dynamic scene editing and rendering. Therefore, combining the advantages of SC-GS and Grid4D or applying Grid4D to SC-GS are probably great future works that might get better rendering quality.

Q2: Results on Neu3D dataset.

A2: We conduct the experiments on the Neu3D dataset and report PSNR in the following table. From this table, one can see that our model has better performance than the state-of-the-art model. The qualitative results can be found in Figure 1 of the top comment PDF. We will add these experiments to our paper in the final version.

Model	Coffee Martini	Cook Spinach	Cut Beef	Flame Salmon	Flame Steak	Sear Steak	Mean
4D-GS	27.34	32.50	32.26	27.99	32.54	33.44	31.01
Grid4D(Ours)	28.30	32.58	33.22	29.12	32.56	33.16	31.49

Q3: Can the authors provide comparison videos ?

A3: To demonstrate the effectiveness of our model, we provide additional rendering results in Figure 12, Figure 11 and Figure 9 of our supplementary. To demonstrate the temporal coherence of our model, we provide some videos by displaying included images in Figure 3 of the top comment PDF. We uniformly interpolate the time and randomly select a camera pose to render the scene reconstructed by Grid4D, and select the images per 5 frames (full 150 frames) to display. We will provide more videos on our GitHub repository in the final version.

Q4: Experimental results on other datasets such as Neu3D since D-NeRF is a toy dataset ?

A4: To further evaluate our model, we also conduct experimental comparisons on the real-world HyperNeRF and Neu3D datasets. The results demonstrate the effectiveness of our model on the real-world datasets.

Q5: Use average results of the time and space consumption instead of a range.

A5: We provide the detailed average metrics in the following table. We will add the table to our paper in the final version.

Model	Training Time	GPU Memory	FPS	PSNR
4D-GS	20min	1GB	160	34.11
DeformGS	33min	4.5GB	62	38.26
SC-GS	75min	3.1GB	179	39.56
Grid4D(Ours)	55min	4.0GB	153	39.86

2024-08-08

Thank you for your comprehensive answer. Most of my problems have been solved. I would like to update my final score to borderline acceptance.

审稿意见

评分: 5置信度: 42024-07-07

The paper presents a grid-based method to compute deformed gaussians to render dynamic scenes from input images. It proposes to perform a 3D decomposition of the 4D input, using multiresolution hash encoding to access spatial (static) and temporal (dynamic) feature vectors. The static features are fed to an MLP responsible of computing a directional attention score, which is composed with temporal features coming from another MLP, resulting in the deformation features. Those features pass through a multi-head decoder, which infer the deformation parameters. Those parameters are used to deform the set of canonical gaussians, which are rasterized to generate the final image.

优点

(1) Contextualization is very good, cited works are sufficient. The Introduction and Related Work sections may be successfully used for familiarization with the fields needed for proper understanding of the paper.

(2) Every part of the method is intuitively presented and the Math holds.

(3) Mathematical notation is very clean.

(4) I liked the introduction of an attention mechanism in this context and the idea behind the smooth regularization.

缺点

(1) I am not convinced about the temporal smoothness and coherence of the proposed method. The interpolations needed for grid-based methods are known to introduce discontinuities even in static cases (NGLOD, Instant-NGP). The paper does not present a supplementary video nor a temporal coherence metric (such as FID-VID or FVD). Moreover, all presented experiments are based on specific frames of the datasets. It is not currently possible to qualitatively or quantitatively evaluate this aspect of the proposed approach, which is crucial for a 4D method. This is the core reason behind my current rating.

Since the authors are not allowed to add a supplementary video nor include a link in the rebuttal (as far as I know), I believe a way to answer this critic with data would be to show a table with temporal coherence metrics. However, I encourage the authors to present other arguments if they find creative ways to be convincing. I liked the paper and I am willing to increase my rating in case the authors are convincing in this subject.

问题

(1) Small presentation advice: at the introduction, it would be better to talk about Figure 1 before Figure 2, because Figure 1 is very good as a primary description of how the 4D encoding is decomposed. Starting with Figure 2 is a little bit more confusing.

(2) I think the Preliminaries section should include a small introduction to multiresolution hash encoding too. It would help the reader to have better context for the remaining of the method section.

(3) I would like to know the values of $\lambda_c$ and $\lambda_r$ .

局限性

Yes, the limitations are presented in the supplementary.

作者回复

2024-08-02

Thank you very much for the constructive feedback. We hope that our response below will address your concerns.

Q1: How to demonstrate the temporal coherence of the rendering results ?

A1: To demonstrate the temporal coherence of our model, we provide some videos by displaying included images in Figure 3 of the top comment PDF. We uniformly interpolate the time and randomly select a camera pose to render the scene reconstructed by Grid4D, and select the images per 5 frames (full 150 frames) to display. We will provide more videos on our GitHub repository in the final version.

In the test stage of the dynamic scene rendering task, the camera poses and timestamps are usually not selected in a video sequence. To our knowledge, there has no existing metric to evaluate the temporal coherence of a video without the ground truth. To further demonstrate the temporal coherence of the rendering results, we attempt to design a simple metric to evaluate the coherence. We use the optical flow estimation model RAFT[1] to predict the forward and backward flow between the two neighbouring frames of the rendered video. Similar to Chamfer Distance, we calculate the mean of total pixel displacements, as the following formula,

$F^{forward} = \frac{1}{NHW}\sum_{i, h, w}||f_{ihw}^{forward}||_2$

$F^{backward} = \frac{1}{NHW}\sum_{i, h, w}||f_{ihw}^{backward}||_2$

$F = \frac{1}{2}(F^{forward}+F^{backward})$

where $N, H, W$ denote to the frames, height, width of the video, $f$ means the corresponding flow. The unit of $F$ is pixel. The lower value of $F$ denotes the better temporal coherence keep in the video.

We use the official pretrained RAFT[1] model of PyTorch to estimate the flow. We uniformly interpolate 150 timestamps along the timeline and randomly select a camera pose for our model to render the video. The results on the D-NeRF dataset are listed in the following table. For reference, we random select a video from the Kinetics-400[2] dataset and calculate the same metric $F$ .

Bouncing Balls	Hell Warrior	Hook	Jumping Jacks	Lego	Mutant	Standup	Trex	Reference
0.248	0.673	1.014	1.799	0.220	0.324	1.331	0.576	2.137

According to the results, the average pixel displacements of the rendered videos are all below 2.0pix and could demonstrate the temporal coherence of our rendering results. However, the metrics mentioned above are not well explored which might need further research.

Q2: What are the values of $\lambda_c$ and $\lambda_r$ ?

A2: In our experiments, $\lambda_c$ and $\lambda_r$ are set as follows: $\lambda_c=0.2$ for all scenes. $\lambda_r=0.5$ for most scenes, $\lambda_r=1.0$ for the 'vrig' part of HyperNeRF dataset.

In most cases, set $\lambda_r=0.5$ can obtain reliable results. For better performance, increasing $\lambda_r$ can help Grid4D render better when the scene primarily consists of rigid objects and simple motions. Similarly, decreasing $\lambda_r$ can help Grid4D for modeling complex motions and non-rigid objects.

Q3: Problems of writing.

A3: We will add the introduction of Multi-resolution Hash Encoding and the description of Figure 1 in the final version.

[1] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020:16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.

[2] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950,2017.

2024-08-08

Thank you for considering the review and the efforts for answering it. I have no further questions. I will reflect a little bit about the presented data about temporal smoothness and coherence and will update my rating to reflect the conclusions.

审稿意见

评分: 5置信度: 42024-07-14

The paper introduces Grid4D, a novel dynamic scene rendering model that leverages hash encoding for 4D input decomposition, enabling high-quality and speedy rendering of dynamic scenes. Unlike traditional plane-based methods that suffer from excessive feature overlap due to low-rank assumptions, Grid4D uses a tri-axial 4D decomposition into spatial and temporal 3D hash encodings. This approach, combined with a novel directional attention mechanism, improves the discrimination and aggregation of deformation features, enhancing both the visual quality and rendering speed of dynamic scenes.

优点

Grid4D's use of 4D decomposed hash encoding addresses the limitations of plane-based methods by effectively reducing feature overlap, which enhances feature discriminability and rendering accuracy. The novel attention mechanism aligns well with the diverse deformations across different scene components, allowing for more accurate deformation prediction and rendering. It is also good to demonstrate superiority over SoTA models.

缺点

Despite its advancements in rendering performance, Grid4D does not significantly improve the training speed compared to existing models.
The sophisticated architecture involving multiple hash encodings and the directional attention mechanism might complicate the implementation and tuning of the model.
The model's complexity and the specific tuning required for different datasets might limit its generalizability or lead to overfitting on particular types of dynamic scenes.

问题

How does Grid4D handle scenes with highly non-uniform motion patterns?
Could the directional attention mechanism be adapted for use in other types of neural network architectures beyond scene rendering?
What are the specific challenges faced when applying Grid4D to real-world datasets compared to synthetic ones?
How does the smooth regularization impact the model's performance in terms of real-time rendering capabilities?

局限性

Grid4D shows limited improvement in training speed.
It does not fully address challenges related to extremely dynamic or unpredictable environments.
The effectiveness of the model heavily relies on the novel encoding and attention mechanisms, which might not translate as effectively to different rendering tasks or simpler scene compositions.

作者回复

2024-08-02

Thank you very much for the constructive feedback. We hope that our response below will address your concerns.

Q1: Grid4D does not significantly improve the training speed in comparison to existing models.

A1: Although our model has no improvement in training speed, our model can obtain a good balance between time consumption and performance. Additionally, the GPU memory consumed with our model is comparable to DeformGS and SC-GS. We report the average time, GPU memory and PSNR on the D-NeRF dataset in the following table.

Model	Training Time	GPU Memory	FPS	PSNR
4D-GS	20min	1GB	160	34.11
DeformGS	33min	4.5GB	62	38.26
SC-GS	75min	3.1GB	179	39.56
Grid4D(Ours)	55min	4.0GB	153	39.86

Q2: Will the complexity and specific tuning of Grid4D limit its generalizability ?

A2: Although our model has multiple hash encodings, it has fewer reconstruction failures than DeformGS on the real-world HyperNeRF dataset, as shown in 'Teapot' (last row of Figure 9 in our paper) and 'Hand' scenes (middle row of Figure 11 in our paper).

To demonstrate the versatility of our model, we also apply Grid4D to the Neu3D[1] dataset which has larger scenes and more complex motions. We report PSNR in the following table. The qualitative results can be found in Figure 1 of the top comment PDF. From this table, one can see that although our model has multiple hash encodings, our model can still yield better performance. We will add these experiments to our paper in the final version.

Model	Coffee Martini	Cook Spinach	Cut Beef	Flame Salmon	Flame Steak	Sear Steak	Mean
4D-GS	27.34	32.50	32.26	27.99	32.54	33.44	31.01
Grid4D(Ours)	28.30	32.58	33.22	29.12	32.56	33.16	31.49

The main tuning of our model is the resolution of the temporal hash encoder because different scenes have different numbers of frames. Additionally, according to the Nyquist-Shannon Sampling Theorem, the temporal resolution can be set to half or a quarter of the frame number automatically by the algorithm.

Q3: How does Grid4D handle scenes with highly non-uniform motions ?

A3: Firstly, explicit representation methods are much more flexible than implicit ones, making Grid4D better suited for modeling highly non-uniform motions. Secondly, our hash encoder decomposes 4D coordinates into four 3D coordinates. When encoding heavily overlapping 4D inputs, the 3D decomposition results will have less same parts. Therefore, the encoded explicit features will have fewer overlapped parts, resulting in a more accurate representation of the corresponding non-uniform motions.

In our experiments, the 'Hand', 'Peel Banana', and 'Chocolate' scenes in HyperNeRF dataset have much more highly non-uniform motions. As listed in Table 5, Figure 9, and Figure 10 of our paper, our methods can significantly achieve better rendering quality than the state-of-the-art models.

Q4: Can the directional attention mechanism be adapted for other usage ?

A4: We consider that directional attention could also be used in 4D generation models for temporal feature selection. For example, we could generate the directional attention score from the features unrelated to the timestamp, and apply the score to the other features.

Q5: What are the specific challenges faced when applying Grid4D to real-world datasets in comparison to synthetic ones?

A5: The main challenge is that the real-world scenes usually have imprecise camera poses which might lead to the failure or degradation of rendering. Another challenge is that real-world scenes usually have more complex motions.

For the imprecise camera poses, the smooth regularization term helps model keep consistency in the neighboring region. Such smoothness makes the model tend to fit the average results captured by the imprecise camera poses, which improves the robustness of our model. We can increase the weights of the smooth regularization term to handle more imprecise camera poses in the real-world datasets.

To handle complex motions, our model improves the discriminability of the explicit features. Encoded by our 4D decomposed hash encoder, our explicit features have fewer overlapped parts, resulting in a more accurate representation of the corresponding non-uniform motions. We can increase the resolution of 4D decomposed hash encoder to model more complex motions in the real-world datasets.

Q6: How does the smooth regularization impact the real-time rendering capabilities of the model ?

A6: The smooth regularization does not affect real-time rendering capabilities. During the rendering period, the smooth regularization term and other losses are not calculated by the model. Smooth regularization is mainly used in the training process and improve the rendering quality.

[1] Li, Tianye, et al. "Neural 3d video synthesis from multi-view video." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

作者回复

2024-08-02

We thank four reviewers for their constructive comments on how to improve our paper. We will provide individual responses below. The qualitative results on the real-world Neu3D dataset, the qualitative comparison to SC-GS, and a simple video exhibition can be found in the following PDF.

最终决定Accept (poster)

2024-09-25

This submission received all positive final ratings after a heated discussion and strong rebuttal, and AC recommends acceptance. Please carefully address the concerns raised by the reviewer in the camera-ready version, especially regarding the new results on Neu3D, using the average performance in Table 7, including the reference suggested by the reviewer, and the video results. In addition, AC suggests the authors include and discuss NeRFPlayer in the revision, which also leverages HashGrid representation for dynamic field modeling.