PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
5
4
4
3.5
置信度
创新性2.8
质量3.0
清晰度2.5
重要性2.8
NeurIPS 2025

Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose a first framework that can compute a 4D spatio-temporal grid of video frames and 3D Gaussian point clouds for each time step in a feed-forward architecture.

摘要

关键词
4D generation; video generation; 4D video generation; multi-view video generation; novel view synthesis

评审与讨论

审稿意见
5

This paper proposes a two-stage architecture for 4D scene generation based on a combination of a 4D video diffusion model and a feedforward reconstruction network. It introduces a fusion of cross-view and cross-time attention with no additional parameters and can generalize well. A reconstruction network is then extended from a prior 3D feedforward model to predict camera parameters and dynamic Gaussian splats directly from synchronized multi-view video frames, enhancing both speed and geometry quality.

优缺点分析

Strengths:

  • The fused view-time attention mechanism is a novel and parameter-efficient solution that effectively leverages large pretrained video models without requiring significant retraining or complex new modules.
  • The paper provides strong quantitative and qualitative results, which support its claims of achieving state-of-the-art results across multiple benchmarks for 4D generation and reconstruction.
  • The experiments incorporate multiple datasets (synthetic, generated, and real-world such as Objaverse, NVIDIA Dynamic, and Neural3DVideo), supporting a robust and wide evaluation. Visualization results demonstrate the enhanced temporal and multi-view consistency of the approach compared to baselines.
  • The feedforward reconstruction module extends VGGT to dynamic 4D settings with targeted modifications (temporal attention, camera token replacement, and a new Gaussian head).

Weaknesses:

  • The ablation studies (e.g., Table 6, Figure 3) are centered on model choices (e.g., attention variants, camera token replacement), but could be strengthened by more extensive analysis, such as examining failure cases in multi-view consistency at scale, sensitivity to training data diversity, or stress tests under extreme scene motions.
  • Compared to 4Real-Video, the effectiveness of the proposed fused view and time attention seems minor (w/o 2D transformed videos).

问题

See the weaknesses.

局限性

Yes.

最终评判理由

My concern has been addressed. After checking others' comments, I think the work has enough contribution to the community. I raise my score.

格式问题

No.

作者回复

We sincerely thank Reviewer mAdS for the thoughtful and constructive feedback. We will incorporate the reviewer’s suggestions to further strengthen the paper in the final version.

More ablation studies

Training data sensitivity analysis

We analyzed the training data sensitivity for both the multi-view video generation and feedforward reconstruction modules below:

  • Multi-view video generation analysis: The table below shows our ablation study on different training set combinations, tested on the object-centric Objaverse dataset and the dynamic scene Nvidia dataset. Each model was trained with identical settings (batch size of 128, 2k iterations). Our observations are as follows:
    • Data efficiency: Our method is data-efficient. For instance, a model trained solely on Objaverse generalizes reasonably well to the out-of-distribution Nvidia dataset.
    • Importance of 2D transformed video datasets: This is a crucial source for dynamic scene generation, yielding the second-best results on Nvidia when used as the sole training data.
    • Combined datasets: Training with all datasets combined delivers strong results across both scenarios.
Train DatasetsObjaverseNVidia
ObjaverseKubric2D Trans. VidPSNR ↑SSIM ↑LPIPS ↓PSNR ↑SSIM ↑LPIPS ↓
23.300.9230.07122.770.7290.130
17.990.8580.17221.950.7060.165
19.520.8810.12523.020.7360.130
22.380.9140.08622.740.7270.136
22.090.9110.08423.380.7420.124
  • Feedforward reconstruction analysis:

In the table below, our ablation study evaluates how training data affects the feedforward reconstruction model. The original model is trained on five datasets: RealEstate10K, DL3DV, MVImageNet, ACID, and static cuts of Kubric. The accompanying table shows performance when the model is trained on each dataset individually. For Neural3DVideo, we fine-tune a dynamic variant derived from each baseline. Training on all datasets results in a noticeable improvement. With more computing power, one could possibly search over all possible subsets of these datasets to find the best possible results.

Train Dataset \ Eval DatasetNeural3DVideo (Dynamic)Tanks & Temples (Static)LLFF (Static)
RealEstate10K20.1120.3718.03
DL3DV20.1720.4818.41
MVImageNet19.9720.0318.33
ACID18.3118.2817.23
Kubric17.1417.7317.16
All21.6320.8518.95

Limited ablations on stress cases and failure modes

We already discuss the limitations in the conclusion section (L326). We agree that it is a good idea to accompany the limitation discussion with qualitative results. Due to the text-based nature of this rebuttal, we are unfortunately unable to provide additional visual examples at this time. However, we commit to including a thorough discussion on failure and stress cases in the final version.

Marginal gain over 4Real-Video w/o 2D transformed videos?

We kindly encourage a re-examination of Table 3. Both our method and 4Real-Video utilize 2D transformed videos during training. Table 3 presents an ablation of our model without 2D transformed videos, yet it still surpasses 4Real-Video in visual quality. Our complete method, trained with 2D transformed videos, demonstrates a substantial improvement in cross-view consistency, while maintaining visual quality comparable to our model without 2D transformed videos.

评论

Thanks for the response. The rebuttal addressed my concerns. After checking others' comments, I think this work has enough insights. I will raise my score.

审稿意见
5

This paper proposes a two-stage framework for 4D dynamic scene generation. First, it introduces a fused view-time attention mechanism in a diffusion transformer to generate synchronized multi-view video grids more efficiently than previous sequential or parallel attention designs. Second, it extends the VGGT feedforward reconstruction model to transform these videos into dynamic 3D Gaussian splats, jointly predicting geometry and camera parameters without slow optimization. Experiments show the method surpasses existing baselines in visual quality, multi-view consistency, and reconstruction accuracy, establishing a new state-of-the-art for general 4D scene generation.

优缺点分析

Strengths

  • The paper presents a two-stage framework that first generates a 4D spatio-temporal grid of video frames. and then reconstructs the time-varying 3D Gaussian splats in a feedforward manner, setting a new state of the art for 4D scene generation.

  • The fused view-time attention combines cross-view and cross-time dependencies into a single masked self-attention layer without introducing extra parameters, making it an efficient architecture that can be fine-tuned on its pretrained video diffusion models.

  • The second part of the model adapts the pre-trained VGGT model for dynamic 4D reconstruction by adding temporal attention layers, a Gaussian head, and a camera token replacement strategy, enabling fast multi-timestep geometry recovery without iterative optimization. The VGGT-based reconstruction pipeline doesn't require camera pose in the input, which may be helpful in alleviating the error caused by the mismatch between the generated views and the camera pose.

  • The method achieves strong results across multiple datasets, outperforming baselines on various metrics on novel view synthesis, video generation, and multi-view consistency. It also provides thorough qualitative examples in the supplementary material, showing high visual quality and multi-view consistency.

Weaknesses

  • The paper mentions that sequential architectures fail to generate proper white backgrounds, but this pattern seems not that hard to learn, and the results on the appendix page look too bad. Could the authors provide more explanations?

  • While the fused view-time attention is claimed to reduce memory and computation through sparse masking and FlexAttention, there is no quantitative evidence or concrete figures to support the efficiency claims.

  • The paper does not provide specific details about the training time and resource requirements of the feedforward reconstruction stage. The VGGT training is costly, and the training code was not released at the time of NeurIPS submission. It would be helpful if the authors could provide more details on this. Additionally, Kubric is the name of the Blender-based dataset generation engine, instead of the name of a dataset. What exact datasets from Kubric did you use to train and evaluate the 4D reconstruction model?

  • The paper layout feels too crowded and might have used too many vspace adjustments. It would be more reader-friendly if the authors organized the content more concisely and left fine details in the supplementary material, accompanied by proper references.

问题

Please see the weaknesses section.

局限性

Yes

最终评判理由

The rebuttal provided important details and additional quantitative stats for the sparse attention mechanism. All my concerns from the initial review have been resolved, and I have increased the score to 5. Please incorporate the details into the paper and organize the manuscript for better readability.

格式问题

NA

作者回复

We thank Reviewer f2R9 for the detailed and encouraging feedback. We will incorporate the reviewer’s suggestions in the final version of the paper.

Why sequential architectures performed poorly?

We have traced the issue to instability when the sequential model is trained on our mixed dataset. Its loss curve fluctuates more and converges to a higher value, reflecting less reliable/stable learning. Although the model performs reasonably on real scenes (Table 2 in the paper), it often misidentifies the white background that appears in about one-third of the training sequences, producing visible artifacts in the results. These background artifacts are simply the most obvious out of multiple failures; a closer inspection also reveals subtler issues such as blurred edges and temporal flickering. The sequential design exhibits an effective capacity that is lower than that of the fused model, making it harder to reconcile the differing background statistics. We will include these findings, along with loss curves and additional qualitative examples, in the revision.

Quantitative evaluation of sparse attention mechanism efficiency

We measured the runtime and peak memory consumption of various architectures. Our profiling was conducted on an A100 GPU using float32 precision, evaluating a single forward pass with 8 views and 61 frames at 288x512 resolution (resulting in 8x16x36x64 tokens after tokenization and patchification). Our findings indicate the following:

  • A sequential architecture, which adds cross-view attention layers (similar to SynCamMaster), would increase network parameters by 50%. This leads to out-of-memory errors, given the already substantial 11B base video model.
  • Another sequential model variant, which utilizes existing DIT blocks from the base video model but interleaves cross-view and cross-time attention dimensions, achieves faster runtime but sacrifices quality (see Table 2).
  • A parallel architecture (similar to 4Real-Video) results in higher peak memory consumption and increased runtime.
  • The proposed fused View-Time attention, when applied without block masking, leads to significantly slower runtime compared to its application with block masking.
Sequential (+ Cross View Attn.)Sequential (Interleave Existing Layers)ParallelFused Attn. (wo Block Mask)Fused Attn. (With Block Mask)
Total Parameters17B11B11B11B11B
Peak Memory (GB)OOM54.955.954.954.9
Runtime (s)OOM9.313.920.511.7

Training time and rescources

  • Training details for feedforward reconstruction: We used the publicly available weights and implemented the training code ourselves based on the released model of VGGT. The training is done on 32 A100 GPUs for around a day. We will provide a thorough explanation of the training and costs in the final version.

  • Unclear details about the Kubric-based dataset: We followed the literature in rendering 32 views for each of the 3k scenes at different timesteps [1]. The views are sampled at the same altitude with constant azimuth angle increments. We’ll clarify and provide details in the final draft.

  • Training details for multi-view video generation: Training cost for the video diffusion model was reported in the supplementary (see Ln.754), it was trained on 48 A100 GPUs for around 2 days, totalling around 96 GPU-days. We also experimented with a reduced training budget, using 32 A100 GPUs for about 20 hours (2k iterations). This yielded qualitatively competitive results, suggesting the method's reproducibility with fewer resources than initially reported. As reported above, our feedforward reconstruction model was trained on 32 A100 GPUs for around a day.

Cluttered layout

We will fix the formatting in the final version.

[1] Liang et al., “Feed‑Forward Bullet‑Time Reconstruction of Dynamic Scenes from Monocular Videos,” 2025

评论

Dear Reviewer f2R9,

Could you please read the authors' rebuttal and respond whether your concerns have been addressed or not? Thank you!

Best regards,

Your AC

评论

Thank you for providing the details and the quantitative stats for the sparse attention mechanism. All my concerns are resolved. Please incorporate the details into the paper and organize it for better readability. I will increase the score to 5.

审稿意见
4

This paper proposes a two-stage framework for 4D scene generation that combines:

A fused view-time attention diffusion model that directly generates synchronized multi-view video frames, and a feedforward reconstruction module that lifts those frames into 3D Gaussian splats with associated camera poses.

The proposed diffusion model introduces a novel fused attention mechanism that unifies spatial (cross-view) and temporal (cross-time) attention into a single sparse self-attention layer. This improves parameter efficiency, consistency, and generalization while avoiding additional synchronization modules used in prior works.

The second stage builds upon VGGT and extend it with 3DGS prediction, enabling end-to-end prediction of camera parameters and dynamic 3D Gaussians in a fast, optimization-free manner.

The approach significantly outperforms prior work across 4D video generation and feedforward 3D reconstruction benchmarks.

优缺点分析

Strengths

  1. Introduces an elegant and parameter-efficient sparse attention mechanism that fuses spatial (cross-view) and temporal (cross-time) attention into a single masked self-attention layer. This unified design allows the spatial and temporal generation can be unified achieved.
  2. The integration of 4D video generation with a feedforward Gaussian reconstruction pipeline into a unified framework is a significant engineering achievement.

Weakness

  1. Unclear Dynamic Modeling in Feedforward Reconstruction: While the paper states that the FFD stage predicts Gaussian parameters (e.g., opacity, scale) using an additional GS head, it does not explain whether and how dynamic motion is explicitly modeled. If only static Gaussians are predicted from the generated videos, the model would theoretically be limited to reconstructing the video itself, not enabling novel view synthesis (NVS) of dynamic scenes. This creates a discrepancy with the experiments, where dynamic NVS appears to be supported. The paper should clarify whether dynamic modeling is included in the FFD stage, and how it is implemented.
  2. Evaluation Disconnect and Scalability Concerns: In Appendix Table S1, the authors mention that FFD fails to run at 16 views on a single A100 due to out-of-memory issues. Moreover, the FFD module is evaluated separately, not as part of the full video generation pipeline. Since the generation stage produces up to 30 frames, this raises questions about the consistency and relevance of the FFD evaluation. The motivation of combining those two huge modules together is questionable.
  3. Training Cost Not Reported: Both the video diffusion model and the FFD module are computationally intensive. However, the paper does not provide adequate details regarding training costs and runtime, making it difficult to assess the method’s practicality and reproducibility.

问题

Please refer to the weakness

局限性

Yes

最终评判理由

Author addresses all of my concerns during the rebuttal period.

格式问题

No formatting issue

作者回复

We appreciate Reviewer w5xW’s positive evaluation and constructive feedback. Our responses to each comment are detailed below, and the paper draft will be updated accordingly.

Unclear Dynamic Modeling in Feedforward Reconstruction

Our feedforward reconstruction model generates pixel-aligned Gaussian particles for each input frame from the multi-view video produced in the first stage, enabling novel view synthesis over time. Specifically, given an input multi-view video of resolution HxW with V views and T timesteps, the reconstruction model outputs V x T x H x W Gaussian splats. This is similar to other feedforward approaches for dynamic scenes—such as L4GM—and does not explicitly model motion, instead generating independent sets of Gaussian particles for each timestep. While we acknowledge that explicit motion modeling remains an important research direction, in this work we model motion implicitly through temporal attention layers. Notably, our framework is agnostic to the reconstruction backbone, allowing it to incorporate more advanced feedforward models as they emerge.

Evaluation Disconnect and Scalability Concerns

  • FFD evaluated separately: Our method has two stages. We rigorously evaluate each component, achieving state‑of‑the‑art performance, and provide qualitative results for the combined stages in the paper as well as the supplementary material. Because of the 100 MB limit, we included fewer videos (without cherry‑picking) and will add more examples in the camera‑ready version. Below, we present a quantitative evaluation of the full pipeline on the Nvidia Dynamic Scene dataset, holding out views 1, 4, 7, and 10 at every timestep for evaluation. The optimization‑based baseline, 4DGS [1], is run on the first view at all timesteps plus the first timestep of every view (first row and first col of the 4D grid); in a second setting, we first complete the grid with the 4D video generator (while using the first column and the first row as input) and then optimize with 4DGS; in a third setting, we feed the generated grid to our feedforward reconstruction model, thereby applying the complete two‑stage pipeline. Our pipeline outperforms the baselines in quality and reduces reconstruction time from over an hour (4DGS) to under 3 seconds. Even if this or another optimization-based method could produce better visual results, it would not be an option due to their prohibitive runtime for every scene. To avoid potential issues with baseline depth/optical-flow priors failing on generated content, we utilized 4DGS as the standard prior-free method for dynamic scene reconstruction.
4DGS4D Video Gen. + 4DGS4D Video Gen. + Feedforward Recon.
PSNR17.6019.0519.14
  • FFD stage faces scalability issues (e.g., OOM at 16 views) without a mitigation strategy

In our experiments, our video generation part generates 8 to 15 views, depending on the setup. We can generate more views and timesteps in an autoregressive fashion. The feedforward model is perfectly capable of processing the original outputs of the video model. Both of the models can be applied autoregressively to support more views/timesteps. We will add thorough clarifications in the final version.

Training cost not reported

  • Training cost for the video diffusion model was reported in the supplementary (see Ln.754), it was trained on 48 A100 GPUs for around 2 days, totalling around 96 GPU-days. We also experimented with a reduced training budget, using 32 A100 GPUs for about 20 hours (2k iterations). This yielded qualitatively competitive results, suggesting the method's reproducibility with fewer resources than initially reported.

  • Training cost for feedforward reconstruction model: We construct our model based on VGGT, and fine-tune the model based on the released VGGT checkpoint. The training is done on 32 A100 GPUs for around a day. We will provide a thorough explanation of the training and costs in the final version.

  • Compute-heavy design raises practicality concerns: We respectfully disagree with the concern that our training cost is excessive.

    • Our 4D video model was trained using 96 GPU-days, which is comparable to, or even lower than, several recent baselines. For instance, 4Real-Video required 32 A100s for 2 days (64 GPU-days), BTimer used 32 A100s for 4 days (128 GPU-days), GS-LRM needed 64 A100s for 3 days (192 GPU-days), LVSM used 64 A100s for 3 to 7 days (192–448 GPU-days), and LRM relied on 128 A100s for 3 days (384 GPU-days).
    • At inference, our framework generates a full grid of 288×512-resolution videos with 8 views and 29 frames using 40 diffusion steps in approximately four minutes, and performs feedforward reconstruction in just a few seconds, comparable to existing video generation methods. We also experimented with varying numbers of diffusion steps (see Table S2 in the Supplementary Material) and reported the corresponding runtimes. While standard diffusion-model distillation techniques could further reduce inference time, such optimization lies beyond the scope of the current work.

[1] Wu et al., “4D Gaussian Splatting for Real‑Time Dynamic Scene Rendering,” CVPR 2024.

评论

Thank you for thorough response. I will maintain my positive score.

审稿意见
4

This paper proposes a two-stage framework for 4D scene generation from text prompts. The first stage uses a novel diffusion model with fused view-time attention to generate synchronized multi-view videos, while the second stage employs a feedforward reconstruction network to convert these videos into dynamic 3D Gaussian representations. The key innovation is the fused attention mechanism that processes spatial and temporal dimensions simultaneously using masked self-attention, eliminating the need for additional parameters while achieving better consistency than sequential or parallel architectures.

优缺点分析

Strength

  1. The fused view-time attention mechanism is elegant and well-motivated. By allowing tokens to attend to others sharing the same view or timestamp through a single masked self-attention layer, it achieves parameter efficiency while maintaining strong performance.

  2. Comprehensive Evaluation: The paper provides extensive experimental validation across multiple datasets (Objaverse, NVIDIA Dynamic, generated videos) with appropriate metrics. The quantitative results consistently show improvements over strong baselines like SV4D, TrajectoryCrafter, and ReCamMaster variants.

  3. The paper is generally well-written and easy to follow.

Weakness

  1. While claiming efficiency gains, there's no rigorous comparison of FLOPs, memory usage, or convergence properties against baseline architectures.

  2. The binary mask definition M(·,·) ∈ {0,1} is hard to follow. The paper doesn't address how spatial tokens within the same (v,t) coordinate interact, whether the mask supports learnable attention weights, or how gradient flow is affected by the irregular sparsity pattern.

  3. The coordinate mapping (v,t,x,y) → (v×Tmax + t, x, y) appears ad-hoc. No comparison with learned 4D embeddings, factorized representations, or other coordinate encoding schemes is provided.

  4. Lack of related paper citations, e.g., Nutworld[1]

[1] Shen Q, Yi X, Lin M, et al. Seeing World Dynamics in a Nutshell[J]. arXiv preprint arXiv:2502.03465, 2025.

问题

See the weaknesses

局限性

See the weaknesses

最终评判理由

The explanation successfully resolves most of my concerns. I will maintain my positive evaluation.

格式问题

No Formatting Concerns

作者回复

We appreciate Reviewer 9auW's positive evaluation and constructive feedback. Our responses to each comment are detailed below, and the paper draft will be updated accordingly.

Efficiency comparison with baseline architectures

We measured the runtime and peak memory consumption of various architectures. Our profiling was conducted on an A100 GPU using float32 precision, evaluating a single forward pass with 8 views and 61 frames at 288x512 resolution (resulting in 8x16x36x64 tokens after tokenization and patchification). Our findings indicate the following:

  • A sequential architecture, which adds cross-view attention layers (similar to SynCamMaster), would increase network parameters by 50%. This leads to out-of-memory errors, given the already substantial 11B base video model.
  • Another sequential model variant, which utilizes existing DIT blocks from the base video model but interleaves cross-view and cross-time attention dimensions, achieves faster runtime but sacrifices quality (see Table 2).
  • A parallel architecture (similar to 4Real-Video) results in higher peak memory consumption and increased runtime.
  • The proposed fused View-Time attention, when applied without block masking, leads to significantly slower runtime compared to its application with block masking.
Sequential (+ Cross View Attn.)Sequential (Interleave Existing Layers)ParallelFused Attn. (wo Block Mask)Fused Attn. (With Block Mask)
Total Parameters17B11B11B11B11B
Peak Memory (GB)OOM54.955.954.954.9
Runtime (s)OOM9.313.920.511.7

Clarification of masked self-attention

We aimed for mathematical precision in defining the binary mask in Eq. 1, but acknowledge that this resulted in a somewhat complex formulation. We will revise the presentation to improve clarity. Conceptually, however, the idea is straightforward: each token is indexed by (v,t,x,y), and tokens are allowed to attend to one another if they share the same view v, the same timestamp t, or both. Attention between token pairs that do not satisfy any of these conditions is masked out. For the reviewers' specific questions:

  • How spatial tokens within the same (v,t) coordinate interact? They attend to each other, the same way as the unmodified 3D self-attention in the original base video model. This is the case of tokens sharing both the same view and same time stamp.
  • Does mask support learnable attention weights, how gradient flow is affected? In an attention layer, the learnable attention weights refer to the MLP parameters used to compute the query (Q), key (K), and value (V) matrices. In the case of sparse attention, these same weights are updated during training. However, token pairs whose attention is masked out do not contribute to the gradient. Only token pairs for which attention is computed affect the gradient update. This follows the standard behavior of masked attention operations and is fully supported by the FlexAttention implementation in PyTorch.

Comparing to other 4D coordinate embedding

We implemented a new baseline that incorporates a learnable positional embedding for the view dimension. Specifically, for each token indexed by (v, t, x, y), we compute separate positional embeddings for the view v and the spatiotemporal coordinates (t, x, y). For the view dimension, we apply sinusoidal encoding with 16 learnable frequencies, followed by a two-layer MLP to produce the final view embedding. For (t, x, y), we adopt the 3D RoPE (Rotary Positional Embedding) used in the original base video model. To integrate both positional embeddings into the attention mechanism, we first add the view embedding to the key and query vectors, and then apply the 3D RoPE. This ensures that the RoPE is applied correctly while still incorporating view-specific information.

We compare this baseline against our proposed embedding under identical training and testing settings (see table below). We observe that the model with learnable view embeddings converges more slowly during training. While it achieves comparable performance on dynamic-scene datasets such as the NVIDIA dataset, it fails to produce correct backgrounds (e.g., white backgrounds in the Objaverse dataset) resulting in catastrophic failures. We hypothesize that this may be due to confusion arising from the mixed nature of the training data. Further investigation, such as tuning the training schedule or dataset composition, may help address this issue. In conclusion, the proposed embedding is empirically more stable to train, achieves competitive or superior quality, and introduces no additional parameters.

ObjaverseNVidia
Positional EmbeddingPSNR ↑SSIM ↑LPIPS ↓PSNR ↑SSIM ↑LPIPS ↓
Learnable View PE19.50.0.8990.161.23.450.7510.113
((v,t,x,y) → (v×Tmax + t, x, y))22.490.9090.11323.150.7520.142

Citation to concurrent work (Nutworld)

We thank the author for recommending the Nutworld paper. We will include it as a related work for feedforward reconstruction.

评论

I appreciate the authors’ thorough response. The explanation successfully resolves most of my concerns. I will maintain my positive evaluation.

最终决定

This paper received unanimously positive reviews after the rebuttal (2 accepts and 2 borderline accepts).

All reviewers recognize the novel fused view-time attention mechanism, which is the core contribution of this paper. This parameter-efficient design, which unifies spatial (cross-view) and temporal (cross-time) attention into a single sparse layer, was seen as a significant architectural innovation. Another design of extending VGGT for dynamic 4D reconstruction by adding a Gaussian head, among other components, is recognized as well by most of the reviewers, which enables fast multi-timestep geometry recovery without iterative optimization. It is also valued by reviewers that extensive empirical results are supported.

There were some concerns after the initial reviewing stage. Notably, these included: (1) a lack of quantitative evidence to support the claimed computational and memory efficiency of the novel attention mechanism; (2) insufficient detail regarding the training costs and resource requirements for the full pipeline, and (3) insufficient details regarding the extension of the VGGT. The authors provided empirical results and more details regarding these comments. After the rebuttal, all of the concerns have been addressed.

The AC therefore recommends accepting this paper. The authors are highly recommended to incorporate the new results and technical details reported in the rebuttal into the revision to improve the quality of the paper.