6.0

/10

Poster4 位审稿人

最低2最高5标准差1.1

4.0

置信度

创新性2.8

质量3.0

清晰度2.5

重要性3.3

NeurIPS 2025

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

Hao LU,Tianshuo Xu,Wenzhao Zheng,Yunpeng Zhang,Wei Zhan,Dalong Du,Masayoshi Tomizuka,Kurt Keutzer,Ying-Cong Chen

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

摘要

关键词

Autonomous Driving; 4D Gaussian Reconstruction

评审与讨论

审稿意见

评分: 2置信度: 52025-06-13

This paper proposes DrivingRecon, a feed-forward 4D Gaussian scene reconstruction model for autonomous driving. It predicts dynamic and static Gaussians from multi-view video using a shared 2D encoder and a range-view decoder. A core contribution is the Prune and Dilate Block (PD-Block), which uses a data-driven learning pattern for spatial redundancy by pruning overlapping Gaussians and reallocating them to complex regions. The work demonstrates support for downstream tasks such as planning, scene editing, and novel view synthesis, and achieves strong performance on nuScenes and Waymo datasets. The system is trained at scale and claims real-time rendering speeds, however, it involves hand-crafted thresholds and omits comparison to recent real-time online GS reconstruction methods that do not require large model training, raising questions about scalability and fairness in the reported claims.

优缺点分析

Strengths:

The PD-Block is a technically sound idea to mitigate spatial redundancy and reallocate Gaussians. 2.The paper reports strong performance on benchmark datasets and supports multiple downstream tasks (reconstruction, editing).
The framework is feed-forward and generalizable, avoiding per-scene optimization and enabling real-time rendering.

Weaknesses:

For the Inference speed reported in Table 1, is it the average time to run all frames of different scenes (whole scene)? Or it is like, it is the total optimization time for the per-scene optimization methods, but it is the per-frame inference time for generalizable feed-forward methods? So DriveRecon's 0.08s is per-frame inference speed including the regression of all attributes (including 2D/3D flow and semantics) of all Gaussians?
I also doubt if it is fair and appropriate to compare the "Inference Speed" this way in Table 1. For the "Per-Scene Optimization" methods, the time is not just for inference, it also includes training. The proposed DriveRecon trained generalization with extremely huge computation (24 x A100 x 24days), so I think it is unfair and inappropriate to compare the "Inference Speed" in this way which includes the optimization time of the "per-scene optimization" methods.
Following the comments above, I think the generalizable model is not necessary if per-scene optimization is fast enough, and the generalizable model is too difficult to train. There are already some real-time online GS reconstruction method since last year, such as RTGSLAM (https://github.com/MisEty/RTG-SLAM/tree/main) which achieves real-time GS reconstruction without the need for pretraining. I think a better comparison for Table 1 would be including these real-time GS reconstruction methods, which could potentially achieve much faster inference speed than the "generalizable ones". Plus, some methods such as RTGSLAM also proposes GS management algorithm to efficiently grow and prune GS - If they work comparably good then there is no point of taking huge hardware and time resources to train the model. This is why I think the comparison to these works are very important.
Writing and formatting issues: Please check the paper contents carefully. There are mismatches in the explanation of terms around line 98–99, and the order of term definitions is also messy. Additionally, typos such as "fellowing" (lines 181, 189) instead of "following" suggest the writing needs more careful proofreading before submission.

问题

I wrote my questions in the weaknesses section above.

局限性

Yes

格式问题

No major formatting issues.

作者回复

2025-07-31

Thank you, reviewers, for your valuable time and comments. Your biggest concerns are the reasoning speed and training time. I think my following reply will alleviate your concerns:

W1: Inference speed reported in Table 1. Our inference speed (0.8s) includes the entire scenario (spaced at t + 0s, t + 0.5s, t + 1.0s, and t + 1.5s) and include all Gaussian properties (including 2D/3D flow and semantics). This setting is exactly the same as that of STORM and is a fair comparison.

W2: Fairness in the comparison of reasoning speeds. Our algorithm only requires 24 gpu days in total (24 A100 * 1 days) instead of (24 x A100 x 24days). Furthermore, our experiment found that it was already close to the best performance at 2 gpu days (24 A100 * 2 hours).

W3: Compared with RTGSLAM. RTGSLAM requires a dense depth map as input. RTGSLAM still takes time to predict dense depth values through RGB images. For comparison, our algorithm only requires RGB images as input to reconstruct 4D Gaussian. Our method has already surpassed the existing Feedforward in the driving field in terms of speed and quality indicators. Unlike Gaussian through post-processing (grow and prune), our algorithm is learnable and end-to-end.

W4: Writing and formatting issues. We will correct these printing errors and typesetting in the final version.

2025-08-04

Dear Reviewer,

I would like to express my sincere gratitude for the time and effort you have dedicated to reviewing my work.

I understand that your primary concerns relate to the inference speed and training time. I believe that my previous response has effectively addressed these issues. Could I kindly inquire if you are contemplating an improvement to the score based on the information provided?

Thank you once again for your consideration.

评论- Request for Reviewer Discussion

2025-08-08

Dear Reviewer WHfH,

Thank you for your valuable time and comments. Your biggest concerns are the reasoning speed and training time. I think my reply has completely addressed your concerns. Would you consider helping me improve my review score? We are very much looking forward to your further feedback.

Thank you once again for your consideration.

审稿意见

评分: 4置信度: 32025-06-17

This paper proposes a Large 4D Gaussian Reconstruction Model for autonomous driving spatial-temporal model reconstruction. There are two main contributions: 1) a prune and dilate module to reduce the number of Gaussian and 2) static and dynamic objects are decoupled to improve the reconstruction quality. Experiments show that the proposed method outperforms state-of-the-art methods on the nuScenes dataset.

优缺点分析

++ The contributions are clearly stated Experiments show that the proposed method outperforms state-of-the-art methods on the nuScenes dataset. The related work section includes most of the recent relevant works in the field of Large Reconstruction Models and Driving Scene Reconstruction.

-- The main issue with the paper is that it lacks some details in the method section: There is no details about the image encoder (size, architecture, etc.). Moreover, the image decoder is not described in detail, which makes it difficult to understand how the image features are transformed into Gaussian points. Moreover, the main contribution of the paper: Prune and Dilate Module is not described in detail: what are the hyperparameters, how they are chosen, etc. In the Static and Dynamic Object Decoupling section, we suppose that the dynamic objects are selected based on their class labels, but this is not clearly stated in the paper.

问题

The rendered resolution of your model remains low: what are the main concerns regarding a higher resolution with 3DGS?

The loss function is composed by several heterogeneous losses: how do you optimize the values of associated hyperparamaters?

局限性

yes

最终评判理由

Even if the paper is still having some issues, I think that the authors have addressed most of my concerns. I think that the proposed model may by accepted if possible.

格式问题

References are not consistent:

Many references titles are uncomplete: eg pixelsplat for pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction, Mvsplat, Omnire, ... You should fix this!

作者回复

2025-07-31

Thank you, reviewers, for your valuable time and comments. Your biggest concern is the model details, the PD-Block details and the segmentation details. I think my following reply will alleviate your concerns:

W1: The details about the model. Our Encoder and Decoder are of UNet structure. The U-Net is built with residual layers and self-attention layers similar to previous works [1, 2]. After UNet, GS Adapter is connected as shown in lines 134-139.

W2: The details about the PD-Block. The details of PD-Block are in the supplementary materials. We conducted ablation experiments on the three most important parameters and found that our algorithm is robust to the parameters.

Parameters	PSNR	SSIM	RMSE
base	23.70	0.590	8.92
τ (0.1)	24.10	0.620	8.50
τ (0.3)	25.22	0.643	7.62
τ (0.5)	25.50	0.640	7.80
τ (0.7)	25.44	0.642	7.77
τ (0.9)	24.85	0.610	8.35
K=3, Z=9	25.00	0.662	7.03
K=5, Z=25	25.80	0.645	7.50
K=7, Z=49	25.40	0.630	7.95

W3: The details about the dynamic objects. The way of selecting dynamic objects to supplement materials. We filter dynamic objects through the speed of boxes, and then use boxes as prompts to segment dynamic objects with SAM.

[1] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)

[2] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic textto-image diffusion models with deep language understanding. NeurIPS 35, 36479– 36494 (2022)

2025-08-04

Dear Reviewer,

I would like to express my sincere gratitude for the time and effort you have dedicated to reviewing my work.

I understand that your primary concerns relate to the model, hyperparameter and training details. I believe that my previous response has effectively addressed these issues. Could I kindly inquire if you are contemplating an improvement to the score based on the information provided?

Thank you once again for your consideration.

2025-08-05

Thank you for this rebuttal. I have read it and I acknowledge that you have addressed most of my concerns. I will take your responses into account when making my final decision on your paper.

2025-08-02

Here is the reply to the reviewer's question:

Q1: Low resolution. We were fellow STORM [1] and chose a resolution of 160 * 240 for our experiment. According to our initial tests, our method supports the highest resolution of 640 * 960, which is sufficient for use in real scenarios.

Q2: Excessive hyperparameters. For each loss of other constraints, we conducted ablation experiments to determine their balanced weights. The experimental results show that the variation of each parameter has a relatively stable impact on PSNR and SSIM, verifying the robustness of the model for weight Settings.

Ablated Param	Value	PSNR	SSIM
$\lambda_{re}$	0.5	27.38	0.82
	1.0	27.52	0.83
	1.5	27.40	0.82
$\lambda_{c}$	0.05	27.36	0.82
	0.1	27.52	0.83
	0.2	27.47	0.82
$\lambda_{r}$	0.05	27.31	0.81
	0.1	27.52	0.83
	0.2	27.42	0.82
$\lambda_{dr}$	0.05	27.29	0.81
	0.1	27.52	0.83
	0.2	27.45	0.82
$\lambda_{sr}$	0.05	27.34	0.82
	0.1	27.52	0.83
	0.2	27.46	0.82
$\lambda_{seg}$	0.05	27.28	0.81
	0.1	27.52	0.83
	0.2	27.39	0.82

[1] J. Yang, J. Huang, Y. Chen, Y. Wang, B. Li, Y. You, A. Sharma, M. Igl, P. Karkus, D. Xu, et al. Storm: Spatio-temporal reconstruction model for large-scale outdoor scenes. Proc. ICLR, 2024

审稿意见

评分: 4置信度: 42025-07-02

This paper introduces a feedforward model that infers 3D Gaussians of driving scenes from multi-view images. It employs PD-Block and dynamic-static decoupling for a reasonable representation of the scene. The paper also explores downstream tasks such as rendering with changed camera poses and intrinsics, pre-training for end-to-end driving, and scene editing. Furthermore, the model achieves state-of-the-art performance when compared to other feedforward reconstruction models for autonomous driving scenarios.

优缺点分析

Strengths

The PD-Block introduces a novel mechanism for managing Gaussian points by learning to both prune and dilate them, thereby enabling a more flexible, non-pixel-aligned feedforward gaussian representation.
The method outperforms existing feedforward driving reconstruction approaches. This article also uses its image encoder to explore pre-training for end-to-end autonomous driving.
A static-dynamic separation strategy for unsupervised learning of autonomous driving scene reconstruction.
Extensive experiments, including reconstruction speed, rendering image quality, and rendering depth comparison.

Weaknesses

The images in Figure 5 and Figure 6 are of low rendering quality. Additionally, the presentation of novel views feels insufficiently clear. For reference, works like FreeVS, StreetCrafter, and ReconDreamer[1,2,3] utilize specific ego-vehicle lane change distances (in meters) for visualization.
Regarding scene editing tasks, is it possible to elevate the scope to include modifications of vehicles within the scene like Street Gaussians[4]? The current article solely showcases the capability of adding novel objects (billboards).
In Table 1, why is the reconstruction quality of per-scene optimization methods so low? Is it because the training iterations were limited? For PVG, if it didn't use LiDAR information, did it utilize SfM to extract the initial point cloud for the scene?

Reference

[1]Wang, Qitai, et al. "Freevs: Generative view synthesis on free driving trajectory." arXiv preprint arXiv:2410.18079 (2024).

[2]Ni, Chaojun, et al. "Recondreamer: Crafting world models for driving scene reconstruction via online restoration." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

[3]Yan, Yunzhi, et al. "Streetcrafter: Street view synthesis with controllable video diffusion models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

[4]Yan, Yunzhi, et al. "Street gaussians: Modeling dynamic urban scenes with gaussian splatting." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

问题

What is depth categories in line 135?

What do L_dr and L_sr in line 171 stand for?

Please refer to other questions in weaknesses.

局限性

The reconstruction quality of per-scene optimization and rendering quality of images can be improved. And the differences between existing SoTA methods should be clarified.

最终评判理由

After reading the rebuttal, I would like to keep my score.

格式问题

None

作者回复

2025-07-31

Thank you for your valuable time and comments. Because the official prohibits the use of any links, we are unable to update the visualization. We hope you can understand. We will update the new visualization in the final version. I think my following reply will alleviate your concerns:

W1: Better quality visualization. Thank you for your suggestion. Our fellow STORM maintains an output resolution of 160*240, so the quality seems relatively low. Our algorithm supports higher resolutions, and our further experimental results show that with higher resolution supervision, our results will be better. We will incorporate better new perspective visualizations (FreeVS, StreetCrafter, and ReconDreamer) in the final version.

W2: Scene editing tasks. Our model supports Dong Jiangtao's separation. So we can add any object to the static GS background. Besides, we can also control the object to move and rotate. We will incorporate our visualization in the final version.

W3: The reason of low quality per-scene optimization methods. Fellow STORM, for a fair comparison of the feed-forward method, there are three changes here: (1) We did not use point cloud initialization. (2) We used fewer perspectives for supervision (4 timestep * 3 views). Such a mechanism has led to the deterioration of the per-scene optimization results.

Q1: The details of the formula. Depth categories is where we use the classification header to predict the depth value. Here, we use a total of 256 categories from 0 to 256 to represent the range from 0 to 256 meters. L_dr and L_sr respectively represent the rendering supervision of dynamic objects and static objects. L_dr and L_dy are the same thing. We will make corrections in the final version.

We once again apologize for the inability to update the visualization. We will provide higher resolution and more specific new perspectives (+ 1m, +2m, +3m) and other visualization results in the final version.

2025-08-06

Thank you for the detailed reply and I will keep my score of 4 considering the author rebuttal.

审稿意见

评分: 5置信度: 42025-07-03

This paper presents DrivingRecon, a feed-forward model for 4D scene reconstruction in autonomous driving. By using a shared 2D encoder and range-view decoder, the model efficiently fuses multi-view images to generate 4D Gaussian representations. The core innovations are the PD-Block, which optimizes Gaussian point distribution by pruning overlaps and dilating complex regions, and dynamic-static decoupling, which enables cross-temporal supervision for better motion modeling. Extensive evaluations on Waymo and nuScenes datasets demonstrate superior performance in reconstruction quality and inference speed, outperforming both per-scene optimization and generalizable feed-forward methods.

优缺点分析

Strengths

Originality of Approach: The PD-Block addresses redundancy in multi-view Gaussian fusion and dynamically allocates computational resources, a novel solution to overlapping artifacts and fixed-point limitations in prior work.
Comprehensive Evaluation: The paper compares DrivingRecon against a wide range of baselines (per-scene optimization and feed-forward models) and provides ablation studies for PD-Block and dynamic-static decoupling, validating component effectiveness.
Efficiency: DrivingRecon achieves faster inference (0.08s) with fewer parameters (53.37M) than sota methods like STORM (100.60M), enabled by shared encoders and range-view decoders. Weaknesses
Inadequate Validation for Fast-Moving or Small Dynamic Objects：while dynamic-static decoupling is proposed, the model’s reconstruction quality for fast-moving objects (e.g., motorcycles, pedestrians) or small entities (e.g., bicycles) is not explicitly tested. The optical flow supervision (Ldy) may struggle with high-velocity motions or occlusions.
Adaptive Limitations of PD-Block：The PD-Block relies on manually set thresholds (e.g., τ for redundancy pruning) and empirical parameters (e.g., K centers and Z nearest points), lacking dynamic adaptation to varying scene complexities. For instance, fixed thresholds may incorrectly prune critical Gaussian points in dense traffic or retain redundant points in complex object edges, degrading reconstruction fidelity. The nearest-point aggregation strategy struggles to generalize across scene densities (e.g., urban centers vs. suburban roads), potentially causing feature under-fitting in sparse areas or computational redundancy in dense regions.

问题

Dynamic Object Resolution: How does DrivingRecon handle small, fast-moving objects (e.g., bicycles) without explicit motion prediction modules? Are there plans to integrate motion-aware Gaussian updates?
Computational Scalability: For large urban scenes with thousands of dynamic objects, does the PD-Block maintain efficiency? Could spatial partitioning (e.g., octrees) further optimize inference?
Temporal Consistency in Long Sequences：For extended driving sequences (e.g., 10+ minutes), how does the model maintain temporal consistency without cumulative errors in Gaussian point trajectories?

局限性

Does this paper have an Appendix?

格式问题

No major formatting issues detected.

作者回复

2025-07-31

Thank you for your valuable time and comments. I think my following reply will alleviate your concerns:

W1: Validation for Fast-Moving or Small Dynamic Objects. I have supplemented more ablation experiments for different speeds and small dynamic objects (people and two-wheelers in driving scenarios are considered small dynamic objects). The experimental setting is the same as that in Table 1 of original paper. We use box labels （with the speed and different objects） prompt the sam to generate the segmentation.

Category	DriveRecon	DriveRecon w/o DS	DriveRecon w/o PD-Block
	PSNR↑ / SSIM↑ / LPIPS↓	PSNR↑ / SSIM↑ / LPIPS↓	PSNR↑ / SSIM↑ / LPIPS↓
Speed > 10 m/s	24.75 / 0.78 / 0.24	23.35 / 0.70 / 0.30	22.85 / 0.68 / 0.32
Speed < 10 m/s	26.53 / 0.84 / 0.16	25.63 / 0.82 / 0.19	25.13 / 0.81 / 0.21
Person	26.51 / 0.79 / 0.20	25.62 / 0.77 / 0.23	25.15 / 0.76 / 0.25
Two-wheeler	26.14 / 0.78 / 0.21	25.21 / 0.76 / 0.24	24.77 / 0.75 / 0.26
Others	27.24 / 0.82 / 0.17	26.33 / 0.80 / 0.20	25.86 / 0.79 / 0.22

W2: Adaptive Limitations of PD-Block. We conducted ablation experiments on the three most important parameters and found that our algorithm is robust to the parameters. Although the receptive field and threshold may bring about some changes, there is still a significant improvement compared to the Baseline, which indicates that our cropping and dilation mechanisms are highly effective.

Parameters	PSNR	SSIM	RMSE
base	23.70	0.590	8.92
τ (0.1)	24.10	0.620	8.50
τ (0.3)	25.22	0.643	7.62
τ (0.5)	25.50	0.640	7.80
τ (0.7)	25.44	0.642	7.77
τ (0.9)	24.85	0.610	8.35
K=3, Z=9	25.00	0.662	7.03
K=5, Z=25	25.80	0.645	7.50
K=7, Z=49	25.40	0.630	7.95

Q1: Dynamic Object Resolution. Our model will additionally predict the optical flow parameters to simulate the movement of objects. Therefore, our algorithm performs better in modeling dynamic objects compared to existing methods, as shown in Table 1. When Motion-aware Gaussian is better at further enhancing small objects, it will be further considered by us. Thank you for your suggestion.

Q2: Computational Scalability. We ablated the inference speeds of different resolutions. We found that the increased time of our PD-Block was very small and could be ignored.

Resolution	Time (s)	Time without PD-Block (s)
160 x 240	0.08	0.07
320 x 480	0.17	0.16

Q3: Temporal Consistency in Long Sequences. We have the geometric evaluation in Table 2, which indicates that our geometric consistency is also the best relative to the existing work.

Q3: Does this paper have an Appendix? We have appendix, please check.

最终决定Accept (poster)

2025-09-17

The paper presented DrivingRecon, a feed-forward model for 4D scene reconstruction in autonomous driving. By using a shared 2D encoder and range-view decoder, the model efficiently fuses multi-view images to generate 4D Gaussian representations. The core innovations are 1) the Prune and Dilate Block (PD-Block) to prune redundant and overlapping Gaussian points and dilate Gaussian points for complex objects. 2) dynamic and static decoupling to better learn the temporary consistent geometry across different time.

It was reviewed by four reviewers in the community. Initially, it received mixed ratings as 1XAccept, 2XBorderline Accept and 1XReject. The major weaknesses lie in Inadequate Validation for Fast-Moving or Small Dynamic Objects, Adaptive Limitations of PD-Block, Computational Scalability, Temporal Consistency in Long Sequences, Scene editing tasks, missing details, The authors submitted rebuttal to address concerns from the reviewers, which as the reviewers confirmed resolved most of the comments. Reviewer WHfH did not respond during the discussion period.

Given the consistency, I would like to accept the paper. The authors are requested to revise the paper accordingly.