PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
4.0
置信度
创新性3.3
质量2.8
清晰度2.5
重要性3.0
NeurIPS 2025

DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

OpenReviewPDF
提交: 2025-04-30更新: 2025-10-29
TL;DR

The first feed-forward reconstruction model for scene-level deformable 3DGS from a monocular video

摘要

关键词
Large reconstruction modelLRMdynamic

评审与讨论

审稿意见
5

The paper proposes DGS-LRM a large generalizable reconstruction model that given a posed monocular video predicts a 4D reconstruction in form of deformable 3D Gaussians in a single feed-forward pass. After applying both spatial and temporal tokenization on the input Video and concatenation of Plücker ray and timestep encodings, a Standard transformer Architecture followed by two-layer MLP predicts the parameters of the deformable 3D Gaussians. These are parameterized using pixel-wise depth and remaining Gaussian attributes (RGB, rotation, scale, opacity) predictions for a single Gaussian along each ray of (temporally downsampled) input and additional reference frames. Additionally, the network predicts pixel-aligned 3D deformation (translation) vectors for all timestamps that can be used to warp the Gaussians predicted for one frame to the timestamps of all other frames. The 3D representation at a certain timestamp is therefore the union of all Gaussians warped to this particular time. For supervision the authors employ standard reconstruction losses of rendered images (MSE, LPIPS) as well as L1-losses to directly supervise pixel-wise depth and deformation predictions with ground truth depth and Scene flow. Since these supervision Signals are usually not available for real-world datasets, the authors create a synthetic dataset generated with the Kubric engine. Despite Training only on this synthetic data, the resulting model generalizes well to real-world Videos. Since the model is only trained for short Clips, the paper further proposes to chain the Scene flow for longer-horizon 3D Point tracking. In an Evaluation on real-world datasets DyCheck and DAVIS, the proposed method outperforms a feed-forward baselines L4GM and falls slightly behind the state-of-the-art 4D reconstruction Approach while enabling real-time inference. The authors further Show that they achieve similar Performance as the state-of-the-art in feed-forward 3D Point tracking. The paper further contains an Ablation studies w.r.t. design choices and loss functions.

优缺点分析

Strengths:

  • The method is simple, intuitive, and elegant:
    • The explicitly predicted deformation vectors allow for straightforward 3D Point tracking.
    • The model is trained on synthetic data only:
      • This allows for the use of additional supervision Signals like Ground truth depth and Scene flow.
      • Still, the trained model generalizes well to real-world Videos, which is a strong result.
  • The Evaluation Shows strong quantitative results compared to feed-forward baselines:
    • It achieves competitive Performance in 3D Point tracking compared to the state of the art.
    • It outperforms L4GM significantly in 4D reconstruction.
  • The Ablation studies validate the effectiveness of the design choices (dual-view sampling, Scene flow loss, reference frames).
  • The paper is mostly well-written and easy to follow.
  • The supplementary material contains qualitative Video results for 4D reconstruction and 3D Point tracking.

Weaknesses:

  • Main lack of clarity: The paper does not explain its technical contributions well.
    • The dual-view sampling is not easy to understand. I understood it as simply supervising with Always 2 target views during Training but in line 258, the authors Mention that they use 8 Output views per Scene during Training.
    • The view selection (lines 177 ff.) is also not completely clear from the text. A visualization similar to Fig.3 could be beneficial.
    • The role of the reference frames is unclear. They are introduced in the context of reducing the GPU VRAM utilization, but it remains unclear how they contribute to that, as more views / frames should mean even larger Memory utilization in the self-attention between all patches.
    • The flow chaining procedure (section 3.4) is not easy to follow. A visualization could be helpful.
  • Further lack of clarity:
    • If I understood lines 145 ff. correctly, the last frame within a temporal downsampling chunk is used for pixel-aligned Gaussians. Is that a Problem for flow chaining as the last (temporally downsampled) frame of the 1st Video and the first frame of the 2nd Video are not the same?
    • In equation 2, should these be unions instead of sums?
    • Why do Rotation & opacity deformations not help? Do you have any Intuition?
  • The Claim of comparable Performance with optimization-based Methods is questionable. The 4D reconstruction Performance is significantly worse than of PGDVS (Pseudo-Generalized Dynamic View Synthesis from a Video. ICLR 2024).
  • The Claim that DGS-LRM is the first feed-forward method predicting deformable 3D Gaussian splats from monocular posed Videos is questionable:
    • Why is L4GM (Large 4D Gaussian Reconstruction Model. NeurIPS 2024) not considered for this Claim?
    • BTimer (Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos) appeared on arxiv in December 2024. Therefore, I would not consider it to be concurrent work anymore.
      • Are quantitative comparisons possible?
  • The Ablation study is missing the effect of:
    • depth supervision and
    • Scene normalization (lines 200ff.) adopted from MegaSaM.
      • This one is especially important since the main baseline L4GM struggles to resolve the correct Scene scale, as pointed out by the authors themselves in lines 284f.
  • Fig. 5 seems to Show input view reconstructions or at least novel views extremely close to the input views. Qualitative results for larger camera motions would be more interesting.
  • It is not trivial to see which Point tracks are better in Fig. 6. Should it not be possible to Show Ground truth tracks for comparison?

问题

Please see the weaknesses for detailed Questions. My main concern is the lack of clarity regarding the technical contributions as well as additional ablation studies and possible comparison with BTimer. A convincing rebuttal w.r.t. these points is necessary for me to increase my evaluation score.

Additional Questions:

  • Is temporal downsampling a Problem for the flow chaining?
  • Why do Rotation & opacity deformations not help? Do you have any Intuition?

局限性

Yes

最终评判理由

Overall, I have been positive regarding this paper already prior to the rebuttal. The main strengths are the novelty of the method, generalization to real data while trained on synthetic data only, and the strong experimental results w.r.t. both novel view synthesis and point tracking for a generalizable feed-forward and therefore fast method. My main concern prior to rebuttal was lack of clarity that the authors largely resolved during the rebuttal. If the authors include the additional explanations into the revised paper version, I do not have any major concerns any w.r.t. acceptance. I also appreciate the additional results provided and pilot ablation studies conducted for the rebuttal.

As a result, I decided to raise my score to 5: Accept.

格式问题

None

作者回复

We thank the reviewer for their insightful feedback and recognition of our work. We appreciate their noting that "the method is simple, intuitive, and elegant", "the evaluation show strong quantitative results" and "the ablation studies validate the effectiveness of design choices".

 

Issues of clarity We deeply appreciate your constructive and detailed feedback. We will fix these issues in a revised version of the paper.

  • The dual view sampling: As mentioned from line 182 to 185, we sample Q=8Q=8 output images at Q2=4\frac{Q}{2} = 4 timesteps for ground-truth supervision. Specifically, at each timestep, we sample 2 images from 2 camera trajectories. This dual view supervision is essentially a multi-view (two-view) supervision that enhances both geometry and novel view synthesis quality. In the revised version of the paper, we will strictly differentiate between the terms "timestep" and "frame."
  • View selection: For videos of static scenes, frames sampled at different timesteps provide sufficient multi-view supervision for novel view synthesis. However, for videos of dynamic scenes, simply using frames sampled at different timesteps can cause ambiguity, as both cameras and objects are moving. Therefore, we employ dual view supervision to sample 2 camera poses at the same timestep for rendering supervision.
  • Role of reference views: The role of reference views is to increase the input baseline, which is important for both geometry and view synthesis quality. We may achieve sufficiently large baselines if the camera moves enough across a long temporal span. However, naively increasing the input video length would add too many input frames, leading to prohibitively large GPU memory consumption. Therefore, we sample K=4K=4 extra temporally distant reference frames to help increase the baseline. A similar strategy was used in BTimer. We will further acknowledge this prior work in the revised version of the paper.
  • Flow chaining procedure We agree that a visualization would be helpful, but we did not include it due to the page limit. We will add a better visualization in the revised version of the supplementary material. Our predicted per-Gaussian deformation field can deform Gaussians to every input timestep, not just the temporally downsampled key timesteps, and therefore will not cause issues in flow chaining.
  • Union instead of sum: Thanks for the detailed comment! We will correct this in a revised version of the paper.
  • Opacity and rotation deformation: For rotation, the per-pixel deformable 3D Gaussians are densely distributed in the foreground area. Each Gaussian typically has a small covariance. Therefore, we empirically did not observe any impact from fixed rotation. For opacity, our pilot study indicates that allowing changes in opacity enables the model to "fake" dynamic appearances without accurately predicting the deformation field, which negatively impacts tracking accuracy. We will add this analysis to the revised version of our paper.

 

Comparisons with PGDVS While DGS-LRM performs better in some examples with complex motion, such as the pinwheel in Figure 4, we agree that PGDVS generally recovers more detailed textures, as indicated by the quantitative data. In the revised version of the paper, we will tone down the use of the word "comparable" and be more specific about the properties of each method. We would also like to emphasize that DGS-LRM is three orders of magnitude faster at inference compared to PGDVS, which requires several hours for inference.

 

First feed-forward method We emphasize that DGS-LRM is the first feed-forward method that predicts deformable 3D Gaussians, supporting both novel view synthesis and spatial tracking simultaneously. Neither BTimer nor L4GM predicts deformation fields. Regarding BTimer, we fully agree that it would be an informative comparison, but their code and trained model are not currently available. Based on NeurIPS policies, we will remove the word "concurrent" when describing BTimer.

 

Ablation studies

  • For depth supervision, our pilot study shows that our dual view supervision already provides sufficient geometry constrain, and adding depth supervision does not significantly impact the reconstruction quality. We simply keep it as we have ground-truth depth data.
  • For scene normalization, both methods normalize the scene using the structure-from-motion depth for a fair comparison. Such a depth is exclusively used in scene normalization to align the scene scale to the benchmark unit. We will include this implementation detail in the revised paper.

 

Qualitative figures While the baselines in Figure 5 are relatively small, we did present view synthesis results with larger baselines in Figure 4 and the supplementary material, where DGS-LRM demonstrates competitive reconstruction quality. For Figure 6, we agree that adding ground truth flow would be useful, and we will include it in the revised version of the paper. Thank you again for the constructive feedback!

评论

I thank the authors for their detailed rebuttal, addressing most of my concerns. It really helped with the lack of clarity regarding its technical contributions. I appreciate that the authors will make sure to improve clarity in the revised paper versions. I do not have any further questions requiring discussion with the authors.

Overall, I am still convinced and agree with other reviewers that the paper should be accepted. I will raise my score to Accept.

评论

We sincerely thank the reviewer for their constructive and insightful feedback. We are committed to incorporating the suggested modifications in the final version of our paper.

审稿意见
4

The paper presents DGS-LRM, a novel feed-forward method for real-time dynamic scene reconstruction from posed monocular videos using deformable 3D Gaussian representations. It augment GS-LRM to handle dynamic reconstruction by predicting on keyframes pixel-aligned gsplat, where each gsplat contains additional translation vectors that warps the gsplat to any other frames. To decrease enormous memory demand, the image patches in GS-LRM is replaced with spatial-temporal volumetric patches. The model is learned with strong supervision and ground truth 3D scene flow. The authors created a new dataset with synchronized camera and scene flow from Kubric.

优缺点分析

Strength

  • The paper tackles a very timely topic and a very challenging task.
  • It provided a new dataset and it looks like the trained model generalizes reasonably well to the test data.
  • Strong performance in 3D tracking.

Weakness

  • Overall, I find the methodology section quite hard parse.
  • The proposed model appears to be the very definition of a "brute-force" attempt. Specifically,
    • all scene flow is directly supervised.
    • Eq 2 suggests that the prediction at the target view fuses all input views pixel-aligned GS to the target view through warping. It means the target view will be rendered with a significantly amount of redundant GS. How does the model scale with high resolution? How many GS are we talking about? Will there be memory issues?
    • Dynamic sparsity is addressed in terms of dataloading, but not in the rendering / representation.
  • In the result, the contending model PGDVS seems to work really well, in fact better than the proposed model. Some explanation was mentioned in the README.pdf about how the evaluation is disadvantageous for the proposed model. Yet, the wording is confusing to me.
  • With PGDVS being a strong contender in the comparison, I expect to see more information about it in the related work section. At least, there should be a baseline paragraph, introducing the comparing methods.

问题

  • Why is the DyCheck reconstruction so much blurrier than PGDVS?
  • Minor: the use of \sum in equation 2 should be replaced by a union?

局限性

  • Handling unseen regions ungracefully. The representation itself is based on strict warping from input views, causing the disoccluded parts to be just black. I don't entirely agree with the authors that "the invisible area is expected to remain unreconstructed". One benefit of ML is to use data priors to fill the holes.
  • Scalability. It's hard to imagine the system could scale up to higher resolution and more dynamic scenes. The data creation dictates that the motion mode only covers rigid motion. The all-to-one warping means huge amount of redundant gaussians will be rendering.

最终评判理由

Given the context, even though the paper has many issues, I think it still paved way for the advance of this very active research field. It should serve as a solid baseline for feedforward dynamic scene reconstruction.

格式问题

No.

作者回复

We thank the reviewer for their constructive feedback and recognition of our work. We appreciate their noting that our method "tackles a very timely topic and a very challenging task", "generalizes reasonably well to the test data" and achieves "strong performance in 3D tracking".

 

Writing of the method section Thank you for the constructive feedback! We noted that reviewer qSmi mentioned that "our paper is mostly well-written and easy to follow" and provided several suggestions to improve the clarity of the paper. We will include changes in our rebuttal to their review in a revised version of the paper, including more detailed explanations of key concepts, such as dual view supervision and reference views, as well as a better visualization for the flow chaining method.

 

Model design Thank you for the insightful feedback! First, we would like to highlight that our method is one of the first attempts to build a feed-forward approach for monocular dynamic scene reconstruction. It is also the first method to jointly predict both 3D Gaussians and deformation fields, supporting novel view synthesis and tracking simultaneously.

  • Scene flow: Our deformation prediction similarly follow the common learning-based 2D and 3D flow prediction methods that also heavily benefit from direct flow supervision, with representative works such as CoTracker [1], SpatialTracker [2], RAFT [3], TAPIR [4], as well as classical works like FlowNet [5] and FlowNet 2.0 [6]. In our pilot study, we put significant effort into testing whether the model could learn deformation fields from rendering loss supervision. Finally, we concluded that ground-truth supervision is necessary to ensure physical meaningful tracking in the monocular video setting and we incorporate them as default in this work. We believe future work can relax this constraints and push this direction further.
  • Large number of 3D Gaussian points:While it is true that a per-pixel Gaussian is an over-complete representation, it offers unique advantages such as fast convergence and high-quality texture details. Consequently, numerous recent feed-forward reconstruction methods adopt it as the scene representation, including PixelSplat [7], GS-LRM [8], GRM [9], Long-LRM [10], FLARE [11], etc. DGS-LRM follows this trend. For a 1-second video with 4 key frames, DGS-LRM generates 512×512×4=106512 \times 512 \times 4 = 10^{6} Gaussian points. We have empirically found that this does not cause memory issues under our current settings. To scale up to higher resolutions, we may adopt Long-LRM's architecture and strategy, using opacity values to filter out redundant Gaussians, which we leave for future work.
  • Dynamic sparsity: We have empirically found that a per-pixel deformation field fits well with a per-pixel Gaussian representation. We attempted to further explore dynamic sparsity by predicting the motion field from a K-plane representation in our pilot study, but found that it converged very slowly.

 

Comparisons with PGDVS We agree PGDVS is a strong baseline but we share a different objective. Our DGS-LRM is a general feed-forward method that emphasize on learning priors from data. It can run three orders of magnitude faster at inference while being closely on par with quality delivered by PGDVS. We observe that DGS-LRM can perform slightly better in some cases involving complex motion, such as the pinwheel example in Figure 4. This improvement is due to our method predicting more accurate deformation fields. We also note that PGDVS reconstructs texture details more effectively, as reflected in both the qualitative and quantitative results, which serves as a strong reference for learning based method to catch up. In the revised version of the paper, we will avoid using the phrase "comparable quality" and instead provide a more specific description of the advantages of each method.

We will add the following descript of PGDVS to the related works:

PGDVS achieves high-quality novel view synthesis on the DyCheck benchmark while significantly reducing reconstruction time compared to many prior optimization-based methods. It leverages off-the-shelf depth and optical flow estimators for initialization. Uniquely, it renders novel views at different timestamps via image-space warping and aggregates results using dynamic object masks, which produce sharp appearances. However, the method still requires hours to build its representation, and in some challenging cases with complex motion, warping and masking errors can cause temporal flickering artifacts. In our work, we focus on an end-to-end feedforward representation that can deliver similar outputs while being orders of magnitude faster in inference. This provides the potential to run at scale.

 

Handling invisible region Since DGS-LRM model is supervised with ground-truth scene flow, it has the ability to recover parts of dynamic objects that are partially observation in a subset of frames. For regions that has never been seen, we require generative prior to hallucinate its appearance, which is beyond the scope of this paper.

 

Scalability Thanks for the insightful feedback! We would like to re-emphasize DGS-LRM is one of the first attempts tackling this challenging problem and it shows competitive generalization ability as well as tracking and view synthesis accuracy on real-world videos under our current setting.

  • Scaling to high-resolution While our current model achieves competitive view synthesis quality without causing memory issues, further scaling up resolution is possible with better training strategies and novel network architectures. Long-LRM manages to decode 32 images with resolution 960×540960 \times 540. Similar strategy can be adopted by DGS-LRM and we leave this for future works.
  • Non-rigid motion Although our current simulation pipeline only contains rigid motion, the generated per-pixel deformation fields are actually diverse and comprehensive. Qualitative experiments also demonstrate our model's generalization ability on non-rigid motion, such as in Teddy and Apple examples. It is also possible to utilize our blender pipeline to simulate non-rigid deformable objects, which we leave for future works.

 

Response to questions:

Q1. Bluriness: As shown in the Supplementary, PGDVS has visible flickering deformation artifacts. Our learning based framework may obtain better deformation but fall short in the textural sharpness at this moment. It is worth noting our inference times a few orders of magnitude faster due to the nature of feedforward prediction, which is not reflected in the video comparisons.
Q2. Union instead of sum: Thanks for detailed suggestion, we will correct it in a revised version.

 

[1] Karaev, Nikita, et al. "Cotracker: It is better to track together." European conference on computer vision. Cham: Springer Nature Switzerland, 2024.

[2] Xiao, Yuxi, et al. "Spatialtracker: Tracking any 2d pixels in 3d space." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3] Teed, Zachary, and Jia Deng. "Raft: Recurrent all-pairs field transforms for optical flow." European conference on computer vision. Cham: Springer International Publishing, 2020.

[4] Doersch, Carl, et al. "Tapir: Tracking any point with per-frame initialization and temporal refinement." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[5] Dosovitskiy, Alexey, et al. "Flownet: Learning optical flow with convolutional networks." Proceedings of the IEEE international conference on computer vision. 2015.

[6] Ilg, Eddy, et al. "Flownet 2.0: Evolution of optical flow estimation with deep networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[7] Charatan, David, et al. "pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.

[8] Zhang, Kai, et al. "Gs-lrm: Large reconstruction model for 3d gaussian splatting." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

[9] Xu, Yinghao, et al. "Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

[10] Ziwen, Chen, et al. "Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats." arXiv preprint arXiv:2410.12781 (2024).

[11] Zhang, Shangzhan, et al. "Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

评论

Thank you for the response. I think put into context, the work provided value to the community. I would raise my score to weak acceptance.

评论

We sincerely thank the reviewer for their constructive and insightful feedback. We are committed to incorporating the suggested modifications in the final version of our paper.

审稿意见
5

The paper addresses the task of 4D scene reconstruction from monocular videos. To solve this task, the authors propose a feed-forward transformer-based approach. Compared to existing feed-forward reconstruction transformer, the main novelty lies in the addition of dynamics.

Next to predicting Gaussian primitive parameters for each input pixel, the authors propose to predict motion offsets for each Gaussian to each frame in the sequence. Another novelty is the usage of a temporal tokenizer from MovieGen, which reduces the computational requirements by a factor of 4 due to 4x temporal compression. One downside of this choice, as the authors state, is that the model can only operate on temporally continuous inputs. To handle longer videos the authors additionally introduce a set of reference frames to help large baseline scenarios or strong temporal scene changes.

Another major part of the paper is data preparation and training. First of all the authors claim the importance of using a synthetic dataset, in order to provide enough scene and motion diversity. To this end the authors create a new synthetic dataset and state that including real-world data in training did not improve performance on real-world evaluation settings.

Quantitatively the authors evaluate NVS on 7 videos from DyCheck and 3D point tracking on PointOdyssey. The quantitative results are very promising, and the proposed method outperforms most baselines significantly. Mainly an optimization based method (PGDVS) outperforms them on NVS, but takes order of magnitudes more time.

Overall, the paper adds another important addition to realm of novel feed-forward reconstruction transformers by incorporating dynamics. Handling dynamics is very relevant for many application scenarios, as the world that we observe is usually not static.

优缺点分析

Strenghts

  • The novelty of the proposed representation for feed-forward transformer, which explicitly allows for disentangled motion prediction, next to 3D scene reconstruction. Explicitly modeling motion (e.g. as opposed to L4GM) makes this approach more usable for down-stream applications, and also allows for additional training supervision, which is ablated to help.
  • The use of temporal tokenization and a set of reference views are both very useful additions, and are also well ablated. Both especially help for videos with more frames, or more (camera) movement. Designing systems that can handle many frames in a video is of special importance.
  • The proposed method shows impressive performance across both NVS and Point tracking tasks on real-world data.
  • If it will be, released the dataset could be very useful for the community. The training strategy is also well ablated.

Weaknesses

  • The novel-view evaluation is only done on 7 videos from the DyCheck dataset. is not that much. The authors claim that multi-view video datasets with a moving camera are sparse, but what about multi-view video datasets with a fixed camera rig? For example there are multi-view video datasets for

    • general deformations: see "Immersive Light Field Video with a Layered Mesh Representation", "Dataset and Pipeline for Multi-View Light-Field Video" or "Neural 3D Video Synthesis from Multi-view Video"
    • human bodies: "Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis"
    • human heads: "NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads"
  • There is only one feed-forward baselines in the paper (L4GM), which is trained on an extremely different domain. Additional comparisons to single-image 3D reconstruction, e.g. pixelSplat or follow-up single-view 3DGS predictors, could be used frame-by-frame. Or comparison to scene-scale methods (LVSM, long-lrm, or any other lrm-style model that e.g. evaluates on RealEstate10k) could also be used. If these methods break on dynamic scenes, even a comparison on static scenes such as RealEstate10k would yield some interesting insights.

  • While it is great to see that the model generalizes well to real world data, the need for synthetic data could be seen as a negative. Especially when it comes to the limitations of the training data, such as only rigid deformations and small camera movements.

Typos:

  • line 183 has the "." before "convergence" instead of afterwards.
  • line 186 eads "GDS-LRM" instead of "DGS-LRM"

问题

  • While the presented method are novel, and the dataset generation and training show great engineering efforts, the evaluation and chosen baselines are not a strong side of the paper. As mentioned in the "Weaknesses" section, I believe that additional single-view feed-forward 3DGS baselines, which can handle scenes would be a great addition. Similarly evaluating NVS on another dataset would also be beneficial, since 7 videos is not all that much.
  • Another option to make the evaluation more bullet-proof would be to use the proposed method as initialization for an optimization-based approach.
  • Would it be possible to add 3D point tracking numbers for the baselines to Table 2, which are comparable to the (FC + FV) setting and the (Native) setting?
  • How over-complete are the reconstructed scenes? Since every a gaussain is spawned for every pixel all keyframes, I assume that objects with little movement have many layers of Gaussians.

局限性

yes

最终评判理由

The paper proposed a novel method for an extremely relevant and upcoming task. A novel and simple dynamic scene representation, temporal tokenization, successful synthetic-to-real transfer and impressive results on real world data, are significantly raising the paper above the acceptance threshold.

格式问题

None

作者回复

We thank the reviewer for their insightful feedback and recognition of our work. We appreciate their noting that "the proposed method across both NVS and Point tracking tasks on real-world data", our temporal tokenization and reference frames "are both very useful additions, and are also well ablated", and our explicit motion modeling "makes this approach more usable for down-stream applications".

 

Multi-view video datasets DGS-LRM is designed as a feed-forward deformable Gaussian reconstruction method for monocular videos of dynamic scenes. As noted in the limitations section, our model, trained on continuous video with temporal tokenization, has difficulty generalizing to discrete image frames that simulate unnatural camera movements. As highlighted in DyCheck [1], teleporting among cameras using multi-view dynamic video datasets reaches unnatural speed far from real-world distribution. We leave developing a framework that can handle both continuous and discrete camera poses as future work. We also hope to highlight that DyCheck is the most widely used dataset for recent dynamic novel view synthesis from a monocular video.

 

Comparisons with static LRM methods Thank you for the constructive suggestion! Most state-of-the-art static feed-forward Gaussian reconstruction methods, such as GS-LRM [2], are not open-sourced. Therefore, we re-implemented GS-LRM to train a static baseline on the RealEstate10K dataset. This baseline achieved a PSNR of 28.77 on the RealEstate10K testing set, slightly higher than the PSNR of 28.10 reported in the original paper. The novel view synthesis quality on the Dycheck dataset, summarized in the table below, is significantly worse compared to DGS-LRM. We agree that this experiment underscores the value of explicitly modeling a deformation field. We will include this experiment in a revised version of the paper.

MethodsPSNRLPIPS
static GS-LRM13.020.444
DGS-LRM14.890.420

 

Synthetic data for training We find synthetic data crucial for the following 2 main reasons.

  • Dual view supervision enhances novel view synthesis quality, especially with larger baselines.
  • Direct supervision of the deformation field is necessary for accurate 3D tracking. Without it, rendering loss alone is insufficient for precise scene flow reconstruction.

Our method is among the first to address this challenging problem. Future work could involve more effective use of real data by distilling knowledge from foundational models, such as monocular depth and flow prediction.

 

Over-complete scene representation Thank you for the insightful feedback! We acknowledge that the per-pixel Gaussian is an over-complete scene representation for the foreground. In our current setting, we have not observed a significant impact on training and rendering speed. However, we anticipate that this issue may become more noticeable with longer context windows. Recent work [3] addresses this problem by utilizing opacity values to filter Gaussian points. We leave exploring this approach as future work.

 

Typos Thank you for pointing out! We will correct them in a revised version.

Table 2 We apologize for any confusion. Table 2 already includes the tracking numbers for the baselines.

 

[1] Hang, Gao, et al. "Monocular Dynamic View Synthesis: A Reality Check", NeurIPS 2022.

[2] Zhang Kai, et at. "GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting", ECCV 2024.

[3] Ziwen, Chen, et al. "Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats." arXiv preprint arXiv:2410.12781 (2024).

评论

I thank the reviewer for answering several questions and their additional experiments.

Overall I stand by my original assesment, that the proposed method passed the quality bar for both novelty and experimental evaluation. I dont have any major objections/questions

There have been two small confusions in the author's response to my original review, and I apologize for not formulating these things more clearly. That being said, they are not absolutely crucial.

(1): Multi-view datasets. Instead of switching cameras, I thought to use one camera stream as input and a few others to evaluate image metrics. Would that be possible?

(2): Tracking Metric in Table 2: I was wondering how the baselines would perform when using the same masking/subset as for "Ours (Native)" and "Ours (FC+FV).

评论

We sincerely thank the reviewer for their constructive and insightful feedback. We are committed to incorporating the suggested modifications in the final version of our paper.

We also appreciate the opportunity to address the two questions raised:

Single Camera Stream from Multi-View Datasets as Inputs: The abovementioned multi-view datasets are based on static cameras, which introduce scale ambiguities in monocular depth prediction when only a single camera stream is used as input. While recent studies have made progress in predicting metric depth from monocular images or videos, our preliminary investigations indicate that their depth predictions are not yet sufficiently accurate for view synthesis.

Tracking metrics on PointOdysessy Thank you for your valuable suggestions! We completely agree that this is a useful metric that further isolates the influence of flow chaining. We are currently conducting experiments to incorporate these new results, but we may need a bit more time to finalize them. If we are unable to include them by the discussion deadline, we are committed to incorporating them in the final version of the paper.

审稿意见
4

This paper presents the Deformable Gaussian Splats Large Reconstruction Model, the first forward model to predict dynamic Gaussians for handling dynamic scenes. Specifically, the proposed model takes a video sequence and corresponding camera poses as input and directly outputs deformable dynamic Gaussians, enabling novel view synthesis of dynamic videos. By adopting a feed-forward approach instead of optimization-based methods, the model significantly reduces inference time while maintaining competitive quality. The main contributions are:

1、A deformable 3D Gaussian representation combined with a keyframe-based rendering strategy for efficient Gaussian generation and rendering.

2、A large-scale synthetic dataset for training and evaluation.

3、A novel Transformer-based architecture that reduces memory consumption.

优缺点分析

Strengths:

The design of the deformable 3D Gaussian representation combined with the keyframe-based rendering strategy is highly novel.

The feed-forward approach addresses the limitations of optimization-based methods, significantly improving speed while maintaining good quality.

A rich dataset has been collected, which—if open-sourced—could greatly benefit future research in this area.

Weaknesses:

The training and inference processes are unacceptable for most researchers, making the model difficult for others to reproduce or deploy.

The method struggles with large object motions within videos, leading to my worries about significant domain gaps; more experiments are needed to clarify this limitation. Please see the questions.

问题

  1. Why do deformations such as rotation or opacity in Deformable Gaussian Splats have little effect on quality improvement?

  2. Would incorporating Frame Attention inspired by VGGT further enhance performance?

  3. In the keyframe-based rendering strategy, is there a noticeable quality gap between keyframes and non-keyframes?

  4. The demo does not include long videos or scenes with large viewpoint changes. Will the model’s performance degrade under these conditions?

  5. Please provide ablation studies on the number of parameters, the consumption of GPU memory and quality.

If the above questions are addressed effectively, or if there is a clear open-source plan, I will consider raising my score accordingly.

局限性

yes

最终评判理由

This work makes three key contributions: (1) a novel deformable 3D Gaussian representation with keyframe-based rendering, (2) an efficient feed-forward approach that outperforms optimization-based methods in speed while maintaining quality, and (3) a valuable dataset that could benefit the community if released. The combination of innovation and practical impact strengthens this work.

格式问题

No

作者回复

We thank the reviewer for their constructive feedback and recognition of our work. We appreciate their noting that our representation, combined with the key-frame rendering strategy, is "highly novel," and that our feed-forward method succeeds in "significantly improving speed while maintaining good quality."

 

Reproducibility We thank the reviewer raising the interest of our work and the impact of the datasets, which we deeply agree as well. We are preparing the release of our code, model and dataset. We are pending formal legal review to confirm the promise but we are working hard towards achieving this goal.

 

Fixed rotation and opacity For rotation, the per-pixel deformable 3D Gaussians are densely distributed in the foreground area. Each Gaussian typically has a small covariance. Therefore, we empirically did not observe any impact from fixed rotation. For opacity, our pilot study indicates that allowing changes in opacity enables the model to "fake" dynamic appearances without accurately predicting the deformation field, which negatively impacts tracking accuracy.

 

Frame attention Our preliminary experiments do not show significant improvements with Frame Attention under DGS-LRM's setting. We assume this is because our method's inputs include camera poses. But we also observe that Frame Attention is very effective in pose-free setting.

 

Keyframes and non-keyframes Thank you for the insightful suggestion! We conducted a quantitative comparison of view synthesis quality for keyframes and non-keyframes respectively on the Dycheck dataset. We observe a moderate degradation on non-keyframes compared to keyframes, which demonstrates the generalization ability of our method. We will include these results in a revised version of the paper for more comprehensive experiments.

Table 1 Quantitative comparisons of view synthesis quality for keyframes and non-keyframes on the Dycheck Dataset

MethodsPSNRLPIPS
Keyframes15.040.419
Non-keyframes14.840.421
DGS-LRM14.890.420

 

Large viewpoint changes and long video For viewpoint changes, as mentioned in the limitations, our view synthesis quality deteriorates as the viewpoints deviate from the input trajectory. This is also true for other view synthesis methods. To alleviate this issue, we leverage dual view supervision from synthetic data and temporally distance reference frames as extra inputs. Both the qualitative results in Figure 4 and the quantitative results in Table 1 demonstrate that DGS-LRM achieves better quality compared to the SOTA feed-forward method and some optimization-based methods on the Dycheck dataset, even with large viewpoint changes.

For long videos, while we design flow chaining method to connect the scene flow from 2 sets of frames, it still causes a noticeable jump in appearance. Further works with a larger context window is required to address this challenge, such as an architecture similar to long LRM [1]. We will provide more detailed explanations in the limitations section of our revised paper.

 

Ablation studies We empirically find that further increasing the model size does not significantly reduce the training loss. Due to constraints on training time and cost, it is challenging for us to train multiple versions with varying model sizes. Our current model has 323 million parameters. Training at a batch size of 4 per GPU takes 80GB of VRAM. We fully agree that verifying whether DGS-LRM follows the scaling law is an important research problem, and we leave this for future work.

[1] Ziwen, Chen, et al. "Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats." arXiv preprint arXiv:2410.12781 (2024).

评论

Thank you for the authors' reply, which addressed all my concerns. The paper is enlightening to me. I decided to keep my positive rating.

评论

We sincerely thank the reviewer for their constructive and insightful feedback. We are committed to incorporating the suggested modifications in the final version of our paper.

最终决定

The paper proposes DGS-LRM, a novel feed-forward approach for real-time dynamic scene reconstruction from a monocular posed video of any dynamic scene. It received two accept and two accept and two borderline accept. Most of the concerns have been solved during the rebuttal. The AC recommends acceptance for this paper.