8.2

/10

Spotlight4 位审稿人

最低5最高5标准差0.0

4.0

置信度

创新性3.3

质量3.3

清晰度3.3

重要性3.0

NeurIPS 2025

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

Zhen Xu,Zhengqin Li,Zhao Dong,Xiaowei Zhou,Richard Newcombe,Zhaoyang Lv

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

We propose a scalable 4D dynamic reconstruction model trained only on real-world monocular RGB videos.

摘要

关键词

Image-Based RenderingLarge Reconstruction Model3DGS4DGSSelf-Supervised TrainingMonocular Reconstruction

评审与讨论

审稿意见

评分: 5置信度: 42025-06-11

This paper introduces 4DGT, a large-scale 4D reconstruction model predicting 4D Gaussian from real-world monocular videos. Comparing to optimization-based method, this feed-forward model significantly reduces the inference time, while achieving great reconstruction quality. It proposes several techniques to facilitate training, such two stage training, pixel-level Gaussian pruning, multi-level attentions. It outperforms other feed-forward reconstruction models with a large margin.

优缺点分析

Strengths

The first feed-forward 4DGS reconstruction model learning from posed monocular videos. Since the input data is easy to acquire, It shows great scale-up ability.
Density control strategy (pruning & densification) to improve efficiency and sampling rates. Comparing to GS-LRM, this model is more efficient.
The results are strong, the rendered appearance, normal and depth are much better than the baseline.

Weaknesses

The model still have difficulty modeling fine-grained motion, e.g. the finger is blur in demo videos.
SoM with 3DGS leads to bad geometry quality, especially normal. I would expect a more comprehensive upper bound (e.g. optimization-based dynamic 2DGS) with the same geometry supervision as 4DGT.

问题

Eq(7) seems problematic, the subscript $i$ and $j$ still exist after summing up. It should be $h_k$ .
Continue from the previous question, it is unclear how $S$ is selected from $h_k$ . And since $k \in {1, 2,..., p^2}$ and $h_k$ is patch-level independent, does it mean you prune with the same pattern for all patches? Authors should discuss the advantage and disadvantage of this design choice.
How this model performs in outdoor scenes, especially with infinitely far-away sky?

局限性

Yes

最终评判理由

Authors have addressed my concern. I am happy with accepting this paper.

格式问题

作者回复

2025-07-29

We thank the reviewer for their insightful and valuable comments and their recognition of our work. We are greatly motivated that the reviewer finds our method performing well both in terms of quality and efficiency and acknowledges the scalability of our pipeline. We provide detailed responses to each of the reviewer's comments and questions in the following.

More comprehensive upper-bound. In Table 1, we present the improved upper-bound optimization-based method. SoM-2DGS-Geometry [1] augments SoM [1] with 2DGS [2] and uses the same normal regularization as ours. With the stronger geometry supervision, we do observe an increase in the normal quality in this improved upper-bound method. It's worth noting that our feed-forward prediction method still achieves comparable results against SoM-2DGS-Geometry [1], while being 3 orders of magnitude faster. After finetuning the feed-forward prediction for only 10 seconds (100 iterations, 150 ms per frame), we can achieve even better results. We will add these experiments to the revised paper.

Table 1: Comparison with a more comprehensive upper-bound on the ADT [3] dataset. SoM-2DGS-Geometry [1] denotes switching to 2DGS [2] for SoM [1] and adding the same geometry regularization as 4DGT. $\mathit{Ours}_{\mathit{tune10s}}$ denotes finetuning the output of the 4DGT model for 10 seconds (100 iterations).

Method	PSNR $\uparrow$	LPIPS $\downarrow$	RMSE $\downarrow$	Degree $\downarrow$	Recon. Time $\downarrow$
SoM [1]	$30.30 ^ {\pm1.64}$	$0.242 ^ {\pm0.024}$	$4.16^{\pm3.00}$	$34.07 ^ {\pm3.61}$	$60000$ ms/f
SoM-2DGS-Geometry [1]	$28.40^{\pm 2.60}$	$0.281^{\pm 0.027}$	$4.18^{\pm 2.98}$	$\mathbf{18.46^{\pm 1.76}}$	$60000$ ms/f
Ours	$28.31^{\pm 1.51}$	$0.243^{\pm 0.021}$	$0.93^{\pm 0.46}$	$25.94^{\pm 1.84}$	$\mathbf{25}$ ms/f
$\mathbf{Ours}_{\mathbf{tune10s}}$	$\mathbf{31.98^{\pm 1.01}}$	$\mathbf{0.220^{\pm 0.012}}$	$\mathbf{0.85^{\pm 0.45}}$	$19.25^{\pm 2.65}$	$25 + 150$ ms/f

Pruning pattern choice and Eq. (7). We thank the reviewer for pointing out the issue in the subscripts of Eq. (7). Indeed, after performing summation across all patches, the subscript $i,j$ should be discarded. We provide a corrected version of Eq. (7) as follows:

\mathbf{H} = \sum_{i=1,j=1}^{N,HW/p^2} \mathbf{M}_{i,j},

where $\mathbf{M}_{i,j} \in \mathbb{N}^{p^2}$ is the activation mask for patch $i,j$ , $\mathbf{H} \in \mathbb{N}^{p^2}$ is the histogram of all activation masks and $p$ is the patch size.

After computing the histogram $\mathbf{H} \in \mathbb{N}^{p^2}$ once after the first stage training, we select its top $S$ values to serve as the pruning pattern for the second stage training. This indeed indicates that the pruning pattern is shared across all patches. The reason behind this design choice is twofold:

We empirically find that in the $p^2$ pixels of a patch, the model tends to favor similar pixels to be activated and later used in the Gaussian rendering process across all patches, as shown in Figure 1 of the appendix (zoom in on the predicted opacity for a better visualization). After aggregating the histogram across all patches, a clear pattern of activation emerges, as shown on the right side of Figure 1 of the appendix. Moreover, in Table 2, we conduct quantitative experiments to compare against computing the pruning pattern for each patch separately on the fly and find that using a shared pattern achieves on-par performance.
With a shared pruning pattern, the pruning operator can be efficiently implemented as simply discarding the unused rows in the weight matrix of the last fully-connected layer of the decoder heads, which is much less demanding both in terms of memory usage and computation time than re-sorting the opacity values on the fly for each patch.

We will add these expanded discussions and results to the revised paper and fix the exposition issue of Eq. (7) to make this process clearer.

Table 2: Quantitative comparison between on-the-fly re-computation of pruning pattern for each patch and our shared pattern approach on the ADT [3] dataset. Shared (Ours: Table 2(b) +D&P) denotes our shared pattern pruning approach, corresponding to the +D&P row of Table 2(b) of the main paper.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	RMSE $\downarrow$	Degree $\downarrow$	Mem. & Speed Overhead $\downarrow$
On the fly Recalc.	$28.36$	$\mathbf{0.241}$	$0.78$	$25.95$	$\mathrm{Yes}$
Shared (Ours: Table 2(b) +D&P)	$\mathbf{28.79}$	$0.242$	$\mathbf{0.72}$	$\mathbf{25.84}$	$\mathbf{No}$

Ourdoor scenes. Our method does not rely on the assumption of the scene being indoor since we train on such outdoor data from the Ego-Exo4D dataset [4]. That said, since we use the same depth activation (Sigmoid with near-far planes) as GS-LRM [5], it currently cannot handle infinite depth and we leave it as future work to incorporate a disparity-based depth prediction scheme for handling such cases. We will add these visualizations and discussions in the revised paper.

Quality. Due to the ill-posed nature of the monocular reconstruction problem (i.e. limited view and FPS), we might still observe some blurriness for sudden and very fast movements, similar to other monocular reconstruction methods. We leave it as future work to incorporate larger-scale datasets and more expert guidance for the 4DGT model to further mitigate this. We will add these discussions to the revised paper. As the reviewer noted, our method already achieves much sharper results compared to the state-of-the-art baselines.

[1] Shape of Motion: 4D reconstruction from a single video, arXiv 2024
[2] 2d gaussian splatting for geometrically accurate radiance fields, SIGGRAPH 2024
[3] Aria digital twin: A new benchmark dataset for egocentric 3d machine perception, ICCV 2023
[4] Ego-Exo4D: understanding skilled human activity from first- and third-person perspectives, CVPR 2024
[5] Gs-lrm: Large reconstruction model for 3d gaussian splatting, ECCV 2024

2025-08-04

How finetuning is applied in Table 1?

评论- Details of the Ours-tune10s variant

2025-08-04

Thanks for pointing out the confusion! For the $\mathbf{Ours}_{\mathbf{tune10s}}$ variant. we simply optimize the output of the 4DGT model directly (i.e. a set of dynamic Gaussians as defined in the "Dynamic Gaussians" paragraph of Section 3.1 of the main paper) on the input images of the test sequence for 10 seconds. This can also be viewed as using the output of 4DGT for an initialization of a per-scene optimization method based on our dynamic Gaussian representation.

2025-08-05

Thanks for your effort during rebuttal! I will keep positive rating as accept.

评论- Thanks for the prompt acknowledgment

2025-08-08

We thank the reviewer for their prompt acknowledgment and we are greatly motivated that the reviewer keeps their positive rating. We will make sure that the added discussions, clarifications and results are added to the final version of the paper.

评论- Authors' Response to your questions

2025-08-03

Dear Reviewer YKwU,

The authors have provided addition data in response to your questions. What is your view after seeing this additional information? It would be good if you could actively engage in discussions with the authors during the discussion phase ASAP, which ends on EoA (Aug 6).

Best, AC

审稿意见

评分: 5置信度: 32025-06-22

This paper proposes a scalable 4D dynamic reconstruction model that leverages 4D Gaussian Splatting (4DGS) to represent and render dynamic scenes. The model can explicitly disentangle dynamic content from the static background and generate realistic optical flow. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art approaches across multiple datasets.

优缺点分析

Strengths:

The proposed feedforward framework enables fast and scalable dynamic scene reconstruction.
The method achieves superior performance compared to existing state-of-the-art techniques on various benchmark datasets.
The experimental results are comprehensive and compelling, providing strong empirical support for the method's effectiveness.

Weaknesses:

The reconstruction of dynamic objects still exhibits certain artifacts; slight camera motion shows the geometric inconsistencies.
Some reconstructed results appear overly smoothed or blurred, potentially compromising fine details.

问题

Although the method surpasses existing state-of-the-art techniques, there remains a gap between the presented results and practical deployment. So, could the outputs of this method serve as a strong initialization for optimization-based reconstruction approaches?
Some results exhibit noticeable over-smoothing, such as the third column in Figure 3 where the hand appears less detailed compared to SoM. Could the authors provide an explanation for this phenomenon?
Would it be possible to include a comparison with optimization-based methods? Such a comparison would help readers better understand the relative strengths and limitations of the proposed approach.

局限性

Yes.

最终评判理由

The authors' rebuttal has addressed the majority of my concerns. I will maintain my original score.

格式问题

None

作者回复

2025-07-29

We thank the reviewer for their constructive feedback, insightful questions and their recognition of our contribution. We are deeply motivated that the reviewer finds our performance superior, our validation thorough and the proposed method effective. In the following, we address each of the reviewer's comments and questions in detail.

Initializing optimization-based method for improved quality. We thank the reviewer for the suggestion. In Table 1, we present the results of using the output of the 4DGT model as an initialization for a optimization-based baseline (denoted as $\mathit{Ours}_{\mathit{tune10s}}$ ), i.e. finetuning our proposed dynamic Gaussian representation on the input video for only 10 seconds (100 iterations), armortizing to 150ms per frame. We can already achieve better results compared to the optimization-based SoM [1], which requires 30000 iterations of iterative update, a complex tracking expert model (TAPIR [4]) and sophisticated densification strategies (3DGS [2]). Even with the finetuned model, we still achieve a 350 $\times$ speedup against SoM [1]. Note that further finetuning will yield even better reconstruction results. We will add these experiments to the revised paper.

Table 1: Quantitative results for initialization on the ADT dataset [3], compared to the optimization-based SoM [1]. $\mathbf{Ours}_{\mathbf{tune10s}}$ denotes finetuning the output of the 4DGT model for 10 seconds (100 iterations).

Method	PSNR $\uparrow$	LPIPS $\downarrow$	RMSE $\downarrow$	Degree $\downarrow$	Recon. Time $\downarrow$
SoM [1]	$30.30 ^ {\pm1.64}$	$0.242 ^ {\pm0.024}$	$4.16 ^ {\pm3.00}$	$34.07 ^ {\pm3.61}$	$60000$ ms/f
$\mathbf{Ours}_{\mathbf{tune10s}}$	$\mathbf{31.98 ^ {\pm1.01}}$	$\mathbf{0.220 ^ {\pm0.012}}$	$\mathbf{0.850 ^ {\pm0.452}}$	$\mathbf{19.25 ^ {\pm2.65}}$	$\mathbf{25 + 150}$ ms/f

Comparison with SoM [1] (comparison with optimization-based method). SoM [1] is a state-of-the-art optimization-based method for dynamic scene reconstruction from monocular videos. It performs per-scene optimization of a Gaussian-based dynamic motion representation with human-labeled motion masks and costly expert models regularization, especially dense all-frame-to-all-frame tracking. Our method achieves comparable results against SoM [1] while being 2400 $\times$ faster. Since SoM [1] performs a lengthy iterative optimization, it might produce sharper details on some areas of the video. Note that tuning the output of the 4DGT model for only 10s (100 iterations, 150 ms per frame) can already produce better results than SoM (30000 iterations), as shown in Table 1. We will expand the discussion in the revised paper to make the exposition clearer.

Quality. Similar to other monocular reconstruction methods, we still observe certain artifacts when viewing the reconstruction results from certain angles (mentioned in the limitation section). We leave it as future work to incorporate larger-scale datasets and more expert guidance for the 4DGT model to further mitigate this. It's worth noting that, as the reviewer pointed out, our method already produces much better results compared to the current state-of-the-art and also achieves comparable results against optimization-based methods. We will add these discussions to the revised paper.

[1] Shape of Motion: 4D reconstruction from a single video, arXiv 2024
[2] 3D Gaussian splatting for real-time radiance field rendering, SIGGRAPH 2023
[3] Aria digital twin: A new benchmark dataset for egocentric 3d machine perception, ICCV 2023
[4] TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement, ICCV 2023

2025-08-01

Thanks for the authors' response. It has addressed the majority of my concerns. I will maintain my original score.

评论- Thanks for the prompt reply

2025-08-04

We thank the reviewer for their prompt reply and their acknowledgment to our response. We are deeply encouraged that the reviewer finds our response informative and we will include the added results in the final paper.

审稿意见

评分: 5置信度: 42025-06-28

This paper proposes a feed-forward model for dynamic Gaussian reconstruction from monocular videos. To model the dynamic Gaussians, this paper introduces additional four Gaussian parameters to represent the motion of Gaussians. The overall model is a Transformer-based architecture to predict dynamic Gaussian parameters from posed images. To make the training more efficient and scalable, this paper proposes a two-stage training strategy and a density control method to remove redundant Gaussians. Experiments are conducted on several dynamic scene datasets and the proposed method achieves state-of-the-art results.

优缺点分析

Strengths

The overall idea is simple and the pipeline is clear, and the proposed method achieves state-of-the-art results compared to previous methods.
The proposed density control method effectively reduce the number of Gaussians, which also makes the proposed model more scalable.
Experiments are extensive, with comparisons to several different baselines and evaluations on several different datasets.

Weaknesses

The choice of the dynamic Gaussian representation (L122-137) doesn't seem to be fully justified. This paper introduces four additional Gaussian attributes to model the Gaussian's motion across time. However, it's not fully clear why this paper chooses to use such a representation. In particular, I wonder if this representation is firstly proposed in this paper, or it's based on previous methods (seems that I couldn't find the related references in the paper). I am also curious about how this representation would compare to 3D flow vector based motion representation or the rigid motion representation as proposed in Shape of Motion paper. I think currently the description of the dynamic Gaussians representation lacks a bit of intuition, it would be better if the authors could clarify why this paper chooses to use this representation. I am also not fully clear whether a careful initialization is needed for the four additional Gaussian attributes.
I think it would be better if the authors could show how the Gaussians evolve over time, this would also help readers better understand how the proposed motion representation works in practice.
It's not very clear if the model relies on metric-scale camera calibrations. According to the training datasets, this paper uses both metric-scale camera annotations and COLMAP camera parameters, which might not be metric scale. However, in the limitation, this paper mentions that "We observe that the quality may degrade in videos recorded by an unseen type of device due to the inaccurate metric scale calibrations". A deeper discussion on why this happens would be beneficial.
I am a bit struggling at understanding which component in the proposed method contributes most to the final performance. From Table 2(a), it seems the loss functions are pretty important. However, a breakdown of different loss terms would be beneficial to gain more insights on the proposed method.

问题

The key questions are listed in the Weaknesses. In addition,

Could the authors report the exact numbers of Gaussians predicted by the model? For example, with 64 and 128 input frames. This would help understand the compactness of the learned 3D representation.
In L55-57, this paper mentions that "Compared to prior methods that also train dynamic Gaussian prediction [41, 24], 4DGT generalizes better to complex real-world dynamics and also exhibits emergent motion properties such as motion segmentation." However, it seems that I couldn't find a comparison with [24] in the experiments. Could the authors elaborate more on how 4DGT generalizes better than [24]?

局限性

Yes

最终评判理由

The rebuttal has adequately addressed my questions, and thus I vote for acceptance.

格式问题

作者回复

2025-07-29

We thank the reviewer for their insightful feedback and their recognition of our paper. We are particularly encouraged that the reviewer finds our method effective and our evaluations thorough. In the following, we prepare detailed responses to each of the reviewer's comments and questions.

Choice of dynamic Gaussian representation. We thank the reviewer for pointing out this confusion. The main purpose for our choice of dynamic Gaussian representation is to enable seamless integration to a feed-forward prediction pipeline, which can be used for self-supervised training on general dynamic videos. Compared to per-frame 3DGS, static 3DGS, flow vector field or decomposed motion bases, we found the explicit modeling of the motion terms of dynamic Gaussians (adapted from FTGS [2], originally proposed in 4DGS [1] (L124, L129)) to be a better fit for this purpose.

Compared to a per-frame 3DGS [5] representation (denoted as the per-frame variant in the ablation studies of the paper), our representation enables the integration of space-time information as a 4D Gaussian with a non-zero life-span automatically encodes information across multiple frames, making it possible to train the 4DGT model in a self-supervised manner on monocular videos. In the most extreme case with infinite life-span, the representation is reduced to a purely static 3DGS (denoted as the Static-LRM baseline in the paper) and would only work on static scenes.
Compared to per-frame 3DGS and static 3DGS, our representation can freely encode the different levels of motion speed (from $0$ to $\infty$ ) of the dynamic scene. Comparison against the StaticLRM baseline and per-frame 3DGS variant can be found in Table 2(a) of the main paper.
Compared to a 3D flow vector field representation, like DynamicGaussians [6], our representation can be easily integrated into the pixel-aligned feed-forward prediction pipeline for patch-based vision transformers. However, their flow vector field, which is typically encoded by an MLP, would be much harder to predict in a feed-forward manner.
Similar problems exist for the rigid motion representation used in Shape-of-Motion [7], since it's extremely ill-posed to accurately predict their motion bases and coefficients without complicated initialization.
In comparison, our representation does not require such careful initialization. In practice, we simply initialize all $\mathbf{v}, \boldsymbol{\omega}$ to zero, $\mathbf{t}$ to the timestamp of the corresponding frame and $\mathbf{l}$ to a large value (50s).
Compared to other implicit network-based methods like NeRF [3] (as used in Pred. 3D Repr. [4]), a Gaussian-based representation would enable much more efficient rendering and training.

We will add these discussions to the revised paper to better motivate the design choices of our dynamic Gaussian representation. We will also revise the reference to make the exposition clearer. We will also add the details about initialization and add a visualization of the dynamic Gaussians' movement to show their evolution.

Metric-scale cameras. We thank the reviewer for pointing out this confusion. We empirically find the model works best when trained and inferred with metric-scale cameras due to the ambiguity in depth-scale estimation. Notably, the model doesn't rely on fully accurate scaling to perform well, as shown by experiments on the COP3D and EPIC-FIELDS datasets [8,9], showing the ability to handle slight deviation from metric-scale calibrations. By introducing such non-metric datasets in training, we force the model to reason from the relative relation of the input cameras and the input images. However, the model would fail to predict coherent results when there exists an order-of-magnitude scale error. We will add these discussions to the limitation section and add more visualization on the COP3D and EPIC-FIELDS datasets to make the exposition clearer.

Breakdown of different loss terms. We thank the reviewer for pointing this out. The regularization terms can be categorized into two groups:

$\mathcal{L}_{\mathbf{v},\boldsymbol{\omega},\mathbf{l}}$ regularizes the motion of the dynamic Gaussian predictions. Without this term, although the quantitative results are not greatly affected, the model would fall into the trivial local minima of making every Gaussian transient and dynamic, not correctly modeling the scene's static or slow-moving parts. This would result in a purely-white motion mask. In Table 1, we evaluate the quality of the motion mask of our method without this term.
$\mathcal{L}_{\mathbf{N},\mathbf{D}}$ provides expert guidance for the geometry of the dynamic Gaussian prediction. As shown in Figure 2 of the appendix and Table 2(a) of the paper, these terms can greatly improve the quality of the reconstructed geometry.

We will add these expanded ablation studies and discussion to the revised paper to make the exposition clearer.

Table 1: Motion segmentation results on the ADT dataset [10].

Method	w/o $\mathcal{L}_{\mathbf{v},\boldsymbol{\omega},\mathbf{l}}$	Ours
mIoU $\times$ 100 $\uparrow$	$9.4^{\pm4.1}$	$\mathbf{81.2^{\pm1.8}}$

Exact number of Gaussians. The number of dynamic Gaussians predicted by the first stage is pixel-aligned. Thus, it can be computed by (derived from Eq. (4) of the paper):

N_{\mathbf{g}} = N \times H \times W.

For the resolution of 252 $\times$ 252 and 16 images (half spatial resolution and 1/4 $\times$ temporal resolution of the second stage (L193 and L252 of the paper)), this corresponds to 508032 (0.5m) Gaussians. For the second stage, the resolution is increased to 504 $\times$ 504 and 64 images (L177-179, L195 and L252 of the paper). With the proposed patch-based pruning strategy, the number of Gaussians can be computed by (derived from Eq (4) and L168-169 of the paper):

N_{\mathbf{g}} = N \times H \times W \times \frac{S}{p^2},

resulting in a total of 829440 Gaussians. Finally, the proposed multi-level spatial attention mechanism introduces two additional downsampled outputs, with 1/4 $\times$ and 1/16 $\times$ the spatial resolution, respectively, leading to a final GS count of 829440 * (1 + 1/4 + 1/16) = 1088640 (1m) for the second stage.
Thanks to our proposed selected activation pruning strategy, the number of Gaussians only increases by 1 $\times$ while the space-time resolution increases 15 $\times$ . We will add these analyses to the revised paper to illustrate the compactness of the proposed model.

Comparison with BTimer (cited as [24]). We thank the reviewer for pointing out this confusion. Since BTimer [11] is not open-sourced, we do not make direct experimental comparisons with it. Compared to BTimer [11], which adopts a per-frame Gaussian prediction pipeline for the bullet-time frame only, our method produces explicit dynamic Gaussians thus can model explicit motion, showing emergent capabilities like motion segmentation. We will update the description in the introduction section (L55-57) to fix the exposition issue:

Revised description: "compared to prior dynamic Gaussian methods that only train on synthetic object-level data [41], 4DGT generalizes better to complex real-world dynamics. Compared to the per-frame prediction pipeline [24], 4DGT also exhibits emergent motion properties such as motion segmentation."

[1] Real-time photorealistic dynamic scene representation and rendering with 4D gaussian splatting, ICLR 2024
[2] FreeTimeGS: Free gaussian primitives at anytime and anywhere for dynamic scene reconstruction, CVPR 2025
[3] Nerf: Representing scenes as neural radiance fields for view synthesis, ECCV 2020
[4] Predicting 3D representations for dynamic scenes, arXiv 2025
[5] 3D Gaussian splatting for real-time radiance field rendering, SIGGRAPH 2023
[6] 4D gaussian splatting for real-time dynamic scene rendering, CVPR 2024
[7] Shape of Motion: 4D reconstruction from a single video, arXiv 2024
[8] Common pets in 3d: Dynamic new-view synthesis of real-life deformable categories, CVPR 2023
[9] Epic Fields: Marrying 3d geometry and video understanding, NeurIPS 2023
[10] Aria digital twin: A new benchmark dataset for egocentric 3d machine perception, ICCV 2023
[11] Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos, arXiv 2024

2025-08-04

I thank the authors for the rebuttal, which adequately addressed my previous questions. Thus, I raise my score. Please incoporate these clarifications and promised visualizations in the revised version of the paper.

评论- Thanks for the prompt feedback

2025-08-04

We sincerely thank the reviewer for the prompt feedback and we’d like to once again express our gratitude towards their constructive review. We are deeply motivated that the reviewer will raise their score. We will be sure to incorporate all the added discussions and add the movement visualization of the Gaussians in the revised paper.

评论- Authors' Response

2025-08-03

Dear Reviewer aM6n,

Best, AC

审稿意见

评分: 5置信度: 52025-06-29

This paper introduces 4D Gaussian Transformer (4DGT), a feed-forward model for fast, high-quality 4D dynamic scene reconstruction from monocular videos. The key technical contribution is a novel density control strategy for training the model. This two-stage approach of pruning and densifying the predicted 4D Gaussians is presented as a crucial innovation, which the authors claim enables the 4DGT model to handle longer space-time input and remain efficient for rendering at runtime. Trained on real-world videos with pseudo-supervision, 4DGT achieves reconstruction quality competitive with slow optimization-based methods while being orders of magnitude faster.

优缺点分析

Strengths

Addresses a Valuable and Challenging Task: The paper tackles a highly relevant and difficult problem in computer vision: fast 4D scene reconstruction from a monocular video. This is a valuable goal with strong potential for real-world applications like AR/VR and robotics.
Strong Experimental Results: The method demonstrates strong performance based on experimental evidence. It achieves rendering quality (PSNR, LPIPS) that is competitive with state-of-the-art optimization methods while being significantly faster. It clearly outperforms existing learning-based approaches on real-world datasets, validating the effectiveness of the proposed architecture and training strategy.
Clarity and Presentation: The paper is exceptionally well-written and clearly organized. The authors effectively motivate the problem, and the proposed 4DGT architecture and novel two-stage training pipeline are explained logically and intuitively, making the complex methodology accessible to readers.

Weaknesses

While the paper presents a significant step forward, its claims could be further substantiated and its contributions clarified with additional analysis and experiments. My main suggestions for improvement are as follows:

Need for More Detailed Ablation Studies. The current ablation study on the loss function (Table 2a) removes all regularization and guidance terms at once. The paper could be strengthened by a more granular analysis.
Quantifying Emergent Semantic Properties. The paper makes a fascinating claim about the model learning "emergent motion properties such as segmentation and flow" (line 40). This is a potentially significant finding. To make this claim more robust, it would be highly beneficial to include quantitative and qualitative results.
Temporal Inconsistency. Dynamic objects in the rendered videos exhibit noticeable flickering, highlighting a lack of temporal consistency. The paper's "Expert guidance" strategy successfully leverages pre-trained depth and normal models. A natural question arises: why not also incorporate guidance from a pre-trained optical flow expert? This would be a logical extension of the existing framework and a direct way to supervise the motion model to reduce these flickering artifacts.
Baseline Comparisons: While the comparison to L4GM and SoM is strong and appropriate, the paper is missing comparisons to other relevant feed-forward methods likes:

[1] Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos[J]. arXiv preprint arXiv:2412.03526, 2024.

[2] Predicting 3D representations for Dynamic Scenes[J]. arXiv preprint arXiv:2501.16617, 2025.

问题

See the weakness section.

局限性

yes

最终评判理由

My initial review praised the paper's novel approach to 4D reconstruction but raised concerns about the substantiation of its claims, particularly regarding emergent properties, and the need for more thorough ablations and baseline comparisons. After carefully considering the author's rebuttal and the subsequent discussions, I have raised my score and recommend acceptance.

格式问题

The paper appears to be well-formatted and I did not notice any major violations of the NeurIPS 2025 formatting instructions.

作者回复

2025-07-29

We thank the reviewer for their constructive feedback and their recognition of our work. We are especially encouraged that the reviewer finds our method intuitive and well-motivated, our evaluation strong, our results showing and our writing clear and accessible. In the following, we address each of the reviewer's comments and questions in detail.

More detailed ablation study. We thank the reviewer for pointing this out. The regularization terms can be categorized into two groups:

$\mathcal{L}_{\mathbf{v},\boldsymbol{\omega},\mathbf{l}}$ regularizes the motion of the dynamic Gaussian predictions. Without this term, although the quantitative results are not greatly affected, the model would fall into the trivial local minima of making every Gaussian transient and dynamic, not correctly modeling the scene's static or slow-moving parts. This would result in a purely-white motion mask. In Table 1, we evaluate the quality of the motion mask of our method without this term.
$\mathcal{L}_{\mathbf{N},\mathbf{D}}$ provides expert guidance for the geometry of the dynamic Gaussian prediction. As shown in Figure 2 of the appendix, these terms can greatly improve the quality of the reconstructed geometry.

We will add these expanded ablation studies and discussion to the revised paper to make the exposition clearer.

Table 1: Motion segmentation results on the ADT dataset [1].

Method	w/o $\mathcal{L}_{\mathbf{v},\boldsymbol{\omega},\mathbf{l}}$	MegaSaM [2]	Ours
mIoU * 100 $\uparrow$	$9.4^{\pm4.1}$	$77.4^{\pm4.0}$	$\mathbf{81.2^{\pm1.8}}$

Quantifying emergent properties. To quantify and evaluate the emergent motion properties of the 4DGT model, we compare the quality of the estimated motion mask, using the ground truth dynamic object segmentation of the ADT dataset [1] in Table 1. Thanks to the explicit modeling of the motion parameters in our representation, our model can produce comparable motion segmentation against MegaSaM [2], which has explicit flow supervision, while being 200 $\times$ faster (1.5s v.s. 300s). We will include these quantitative and qualitative results in the revised paper.

Guidance from optical flow experts and consistency. Although our model can handle moderately fast motion of the dynamic scene, it might produce some flickering during sudden movements of the dynamic part due to the ill-posed nature of monocular reconstruction (i.e. limited view and FPS). It's worth noting that, as the reviewer pointed out, our method already produces much better results compared to the current state-of-the-art feed-forward methods and also achieves comparable results against optimization-based methods.

Indeed, one natural way to improve its consistency is to incorporate a flow expert's guidance. It seems an optical flow expert like RAFT [3] can easily be plugged into our pipeline but it's non-trivial in practice. There are two reasons we did not adopt such a flow model for guidance:

In our preliminary experiments, we discovered that the estimated optical flow exhibits strong inconsistency, often reaching more than 10 pixels in cycle consistency errors. This in turn makes the training of the 4DGT model unstable and leads to NaN values in the prediction.
A tracking expert model like TAPIR [4] would produce much more consistent results for guiding the prediction of dynamic Gaussians, as shown by Shape-of-Motion [5]. However, the computation of such dense all-to-all tracking is extremely time-consuming (a few hours for a 128-frame clip), making it impractical for our large-scale training setup (1000 hours of video data).

Due to these reasons, we leave the addition of the flow expert model's guidance to improve the temporal consistency to future work. We will add these discussions to the revised paper and expand the limitation and future work section for this possible extension.

Baseline comparisons. We thank the reviewer for pointing out these relevant feed-forward baselines. However, since both BTimer [6] and Pred. 3D Repr. [7] are not open-sourced, we do not make direct experimental comparisons with them. Compared to B-Timer [6], which adopts a per-frame Gaussian prediction pipeline, our method produces explicit dynamic Gaussians thus can model explicit motion, showing emergent capabilities like motion segmentation. Compared to Pred. 3D Repr. [7], which adopts a tri-plane based implicit representation, our Gaussian model enables fast rendering after the feed-forward reconstructrion. We will add these discussions to the revised paper. We will also add the missing reference to Pred. 3D Repr. [7] to the related work section, where we discuss the other feed-forward baselines (L95-L100).

[1] Aria digital twin: A new benchmark dataset for egocentric 3d machine perception, ICCV 2023
[2] MegaSaM: Accurate, fast, and robust structure and motion from casual dynamic videos, CVPR 2025
[3] RAFT: Recurrent all-pairs field transforms for optical flow, ECCV 2020
[4] TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement, ICCV 2023
[5] Shape of Motion: 4D reconstruction from a single video, arXiv 2024
[6] Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos, arXiv 2024
[7] Predicting 3D representations for dynamic scenes, arXiv 2025

2025-08-04

Thank you for the strong rebuttal. The new quantitative results are convincing and have fully addressed my concerns. I will raise my rating accordingly.

评论- Thanks for the prompt feedback

2025-08-04

We thank the reviewer for their prompt reply and acknowledgment of our response and we are greatly encouraged that the reviewer will raise their rating. We will include the added results and discussions in the final paper.

评论- Authors' Response

2025-08-03

Dear Reviewer koHi,

Best, AC

评论- Discussion with Authors

2025-08-02

Dear Reviewers,

The discussion period with the authors has now started. It will last until Aug 6th AoE. The authors have provided responses to your questions. I request that you please read the authors' responses, acknowledge that you have read them and start discussions with the authors RIGHT AWAY if you have further questions, to ensure that the authors enough time to respond to you during the discussion period.

Best, AC

最终决定Accept (spotlight)

2025-09-17

This paper proposes a novel feedfoward approach for reconstructing 4D Gaussians. Four expert reviewers provided final ratings of 4 x accept. They appreciated the work for addressing a novel task, its strong results and for providing extensive and thorough experimental evaluation. The reviewers' questions were addressed during the author-reviewers discussion phase. The AC concurs with the reviewers' consensus and recommends acceptance. Congratulations! The authors should make the changes to the final manuscript that they have promised.