PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.8
置信度
创新性2.5
质量3.0
清晰度3.0
重要性2.8
NeurIPS 2025

GaussianFusion: Gaussian-Based Multi-Sensor Fusion for End-to-End Autonomous Driving

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
End-to-End Autonomous DrivingMulti-Sensor FusionGaussian

评审与讨论

审稿意见
5

Current end-to-end driving methods handle perception and sensor fusion by either geometrically projecting encoded sensor features into a BEV representation or implicitly mapping features to a 2D grid via global attention. This paper instead proposes to use a point based 2D Gaussian intermediate representation upon which all features are mapped. The paper then proposes a refinement pipeline to convert the Gaussian features to planning outputs. The method is evaluated on Bench2Drive and NavSim.

优缺点分析

The paper investigates a new representation for sensor Fusion 2D-Gaussians in BEV. This is interesting in particular because Gaussians have been so successful in NVS / 3D reconstruction. The paper follows the standard recipe of proposing a method, comparing its performance against baselines, and ablating components. I particularly appreciate the choice of using a combination of NavSim and Bench2Drive to assess the performance of the method. This measure performance with realistic sensor data (NavSim) with a good metric as well as closed-loop performance (Bench2Drive) both for short driving periods.

L. 82: “significantly surpassing current state-of-the-art methods.” The state-of-the-art (of prior work) on NavSim is Hydra-MDP-C with 91.0 PDMS. Since this method achieves 88.9 PDMS it does not surpass the state of the art, so the claim is incorrect. However, this paper uses a lightweight vision backbone, so I recommend making some claim around high performance with low inference time instead, or making a claim around surpassing the state-of-the-art in methods that use ResNet34 as vision backbone.

[1] Could additionally be cited when discussing BEV fusion methods for E2E driving.

Table 1 does not make it clear that only baselines using the ResNet34 backbone are shown, and better methods like Hydra-MDP++ with the V2-99 backbone are omitted.

The EPDMS is a metric from the NavSim v2 benchmark that uses 2 stages. This paper uses this metric with the NavSim v1 benchmark and seems to alter the metric by using only 1 stage. It is not clear to me why the EPDMS metric is used with NavSim v1, it does not seem to provide additional insights compared to the standard PDMS metric. I recommend the authors change Table 1 to use the standard PDMS metric (which is in Appendix Table 5), for which there are also more baselines available. I also recommend that more baselines be added to the table and that the author include the SOTA, like Hydra-MDP-C and Hydra-MDP++ (V2-99). The methods could, for example, be grouped by the image backbone being used to make it clear that some of the better methods use larger image backbones (and compare GaussianFusion within the ResNet34 group).

Training end-to-end driving methods can exhibit quite some variance. Both Tables 1 and 2 would benefit from evaluating multiple (3 is typical) training seeds and reporting the mean and standard deviation. NavSim is deterministic, but Bench2Drive is not, so multiple evaluation seeds (3) would also be better here.

L. 307: The paper claims here that the method does not significantly increase model complexity because the number of parameters is not increased. This argument would be more convincing if, additionally, the number of FLOPS used or inference time were also reported, as they might increase with the iterative processing of Gaussians. Table 4 suggests that compute requirements do increase. The paper claims that this is due to worse cuda kernel optimization. This argument could be more convincing if the number of flops were reported.

The examples shown in Figure 4 do not seem particularly interesting. The first example is at the end of a left turn, where the car just has to follow the lane. The second and third examples seem to be basic straight driving. Figure 4 could be moved to the appendix if more space is needed in the main paper.

The 2 sentence limitations in the conclusion seem to be too brief. There are probably more limitations to this study. These can perhaps be expanded in the appendix if space is a concern.

Overall, I appreciate the study of using 2D Gaussians for sensor fusion. I am concerned that the experimental section focuses more on selling the method than educating the reader about its advantages and disadvantages. The appropriate state-of-the-art methods should be acknowledged in the table comparisons. If the paper wants to claim state-of-the-art performance, it should train GaussianFusion with better vision backbones and show that it can actually achieve state-of-the-art numbers. Otherwise, the claims should be toned down and made specific to the ResNet34 backbone. I would consider raising my score if this is addressed.

References:

[1] Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zak Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, Jamie Shotton: “Model-Based Imitation Learning for Urban Driving”, NeurIPS 2022

问题

L. 269: Cropping the camera image to only 448 × 250 (is it HxW or WxH?) seems to be too small to me. For comparison, TF++ uses (384x1024 HxW). Are multiple cameras being used here? Or do the authors have any intuition why such a small camera was sufficient for Bench2Drive?

L. 311: How are the actions of the car computed if the model prediction head is removed? Is the whole model architecture trained in one stage, or is the vision backbone trained separately with semantic mapping?

The 77.2 DS result of TF++ result in Table 2 is marked as using the same backbone and training dataset as GaussianFusion. TF++ reports 84.2 DS performance in the draft using a regnety_032 backbone. The training dataset used in this work seems to be the same dataset as used in TF++, except that only the town 12 subset of the data was used. So is the lower performance of TF++ a result of using less data and a smaller backbone? The description of Table 2 should explicitly state that the other baselines besides TF++ were trained using a different dataset (Bench2Drive) than GaussianFusion and TF++. This is fine but should be said explicitly.

局限性

yes

最终评判理由

My concerns have been adressed in the rebuttal. I update my rating to accept.

格式问题

no

作者回复

Q1. Additionally experiments about V2-99 vision backbone.

Thank you for your helpful suggestion. To explore the scalability of our framework, we extended GaussianFusion by applying a more powerful image backbone (V2-99) and increasing the dimension of Gaussian feature from 128 to 256 to adapt to V2-99. The results are shown in Table 1 below.

As we can see, our enhanced version (GaussianFusion*), equipped with V2-99, further improves performance across all metrics—achieving state-of-the-art results on the Navtest benchmark. This demonstrates the scalability of our framework with stronger backbones.

We note that, due to the tight rebuttal timeline, we were only able to perform a single-seed training run for this extended version. Nevertheless, the results already show clear performance gains. The results will be updated to Table 1 of the paper for comprehensive comparison.

Table 1. Performance on the Navtest Benchmark. ‘*’ denotes an extended version of our method.

MethodInputImg. BackboneNC ↑DAC ↑TTC ↑EP ↑PDMS ↑
LTFCameraResNet3497.492.892.479.083.8
TransFuserC & LResNet3497.792.893.079.284.0
GoalFlowC & LResNet3498.393.394.879.885.7
Hydra-MDPC & LResNet3498.396.094.678.786.5
Hydra-MDP++CResNet3497.696.093.180.486.6
ARTEMISC & LResNet3498.395.194.381.487.0
DiffusionDriveC & LResNet3498.296.294.782.288.1
GaussianFusion (Ours)C & LResNet3498.397.294.683.088.8
GoalFlowC & LV2-9998.498.394.685.090.3
Hydra-MDP-CCV2-9998.798.295.086.591.0
Hydra-MDP++CV2-9998.698.695.185.791.0
GaussianFusion* (Ours)C & LV2-9998.798.195.788.292.0

Q2. Why the EPDMS metric is used with NavSim v1?

Thank you for the question. We adopt EPDMS because it is a more challenging and comprehensive evaluation metric than PDMS. Originally proposed by Hydra-MDP++ [1], EPDMS extends the standard evaluation protocol by introducing several additional criteria:

  • Lane Keeping (LK)
  • Extended Comfort (EC)
  • Driving Direction Compliance (DDC)
  • Traffic Light Compliance (TLC)
  • False-Positive Penalty Filtering

These components enable a more holistic assessment of driving behavior. Importantly, while EPDMS is employed by NavSim v2, it is also compatible with NavSim v1 for single-stage evaluation.

We agree with the reviewer that only a limited number of existing methods have been evaluated under EPDMS, which may affect direct comparability. Therefore, following your suggestion, we will move the EPDMS results to the Appendix in the final version.

Q3. Training end-to-end driving methods can exhibit quite some variance. Both Tables 1 and 2 would benefit from evaluating multiple (3 is typical) training seeds and reporting the mean and standard deviation.

Thanks for your helpful suggestion. We have trained models under 3 training seeds on both the NAVSIM and Bench2Drive benchmarks. As the evaluation on the CARLA-based Bench2Drive benchmark is very time-consuming, one trained model is only evaluated once. Due to the limitation of character count, please see the results in Tables 1 and 2 of the response to the Reviewer v3h9.

On the NAVSIM benchmark, our method exhibits stable performance with minimal variance, reflecting its robustness to random initialization and consistent behavior across multiple runs.

For the closed-loop Bench2Drive benchmark, while slightly higher variance is observed—as expected due to the complexity and stochastic nature of interactive scenarios—our method consistently surpasses other baselines in average performance. These results will be updated to the final version of the paper.

Q4. Reporting FLOPs to better show the computational efficiency.

According to your suggestion, we have calculated the FLOPs of different fusion methods. The results are shown in Table 2 below.

Table 2. Comparison of different fusion methods on accuracy and efficiency.

Fusion MethodParam.mIoU↑EPDMS↑FLOPs (G)Latency (ms)↓
Flatten55.7M0.5081.827.418
BEV51.9M0.5183.858.443
Gaussian51.6M0.5584.535.732

Compared with the BEV fusion method, our Gaussian fusion achieves better performance (+0.04 mIoU and +0.7 EPDMS), while also significantly reducing computational cost: the FLOPs are 38.9% lower (58.4G → 35.7G) and latency is reduced by 25.6% (43ms → 32ms), showing clear advantages in both accuracy and efficiency.

Compared with the Flatten (TransFuser-style) fusion, which is a very lightweight approach, our method exhibits only a 30.3% increase in FLOPs (27.4G → 35.7G), while achieving much stronger performance (+0.05 mIoU and +2.7 EPDMS). However, our method's latency is 77.8% higher (18ms → 32ms), which we believe is largely due to the differences in operator-level optimization. This could be further mitigated through engineering improvements in deployment.

Q5. The examples shown in Figure 4 do not seem particularly interesting. The first example is at the end of a left turn, where the car just has to follow the lane. The second and third examples seem to be basic straight driving. Figure 4 could be moved to the appendix if more space is needed in the main paper.

Thanks for your suggestion. We will move Figure 4 to the appendix and include additional visualizations of more complex scenarios and failure cases in the final version of the paper.

Q6. The 2 sentence limitations in the conclusion seem to be too brief. There are probably more limitations to this study. These can perhaps be expanded in the appendix if space is a concern.

Thank you for pointing this out. We agree that the limitations section in the conclusion is overly concise and does not sufficiently reflect the boundaries of our study. In the final version, we will include a dedicated Limitations section in the appendix to provide a more comprehensive discussion.

Based on the reviewer’s suggestion and further reflection, one important limitation is that we have not evaluated the robustness of our fusion method under sensor occlusion or noise, which is a critical aspect for real-world deployment.

Another limitation is that our current design relies on 3D bounding boxes and road topology annotations to optimize 2D Gaussians. While effective, such supervision can be expensive or unavailable in real-world deployment scenarios.

We appreciate the reviewer’s input in helping us better contextualize the scope of our work.

Q7. Citing a BEV method MILE [1] for E2E driving.

Thank you for pointing this out. MILE [1] is a pioneering work in BEV-based end-to-end driving. It lifts image features into 3D space by leveraging a predicted depth probability distribution and then aggregates the resulting voxel features into the BEV space via a predefined grid-based sum-pooling strategy. We appreciate the reviewer’s suggestion and will include a citation and discussion of MILE in the related work section of the final version.

Q8. About the image resolution setting.

Thank you for pointing this out. We acknowledge that the resolution setting was not clearly described in the manuscript. The resolution of 448 × 250 (W×H) is used on the NAVSIM benchmark, where multiple camera views are concatenated. For the Bench2Drive benchmark, we adopt only the front-view camera, which already provides a wide field of view. In this case, we use the resolution of 1024 × 384 (W×H), which is consistent with the setting used in TransFuser++.

We will clarify this resolution setting and correct the description in the final version of the paper.

Q9. About training of the GaussianFusion framework.

In our framework, the perception head (used for map construction) and the planning head (used for trajectory prediction) are designed to be independent branches. The perception head does not contribute directly to the driving actions; rather, it serves as an auxiliary supervision signal to optimize 2D Gaussians and build geometry inductive biases.

As for the training strategy, our model is trained in a single-stage end-to-end manner, without any separate pretraining of the perception head.

Q10. About the training settings on the Bench2Drive benchmark.

We believe the lower performance of TF++ in our experiments can indeed be attributed to the factors: the smaller model capacity and the reduced training data.

Specifically, the original TF++ paper utilizes RegNetY-32 as the backbone and is trained on a dataset that is roughly three times larger than what we used in our experiments. In contrast, our reproduced version of TF++ adopts a lighter-weight backbone (for fair comparison) and is trained from scratch with a smaller training split (due to the computing power limitation).

According to your suggestion, we will explicitly state the differences about the training datasets in the caption of Table 2 of the paper.

References

[1] Hu A, Corrado G, Griffiths N, et al. “Model-Based Imitation Learning for Urban Driving”. NeurIPS, 2022.

评论

Dear Reviewer,

There is an author-reviewer discussion stage before Aug. 6th. Please take a look at of the authors' rebuttal as soon as possible, and leave a message to authors in case you still require additional clarity, or if you want to post points of disagreement for discussion with the authors.

AC

评论

My concerns have been adressed in the rebuttal. I update my rating to accept.

评论

We sincerely thank you for your positive recommendation and insightful comments. We’re glad that our responses have clarified your concerns. Your feedback is invaluable and will motivate us in further improving GaussianFusion and advancing our future research efforts.

审稿意见
5

This paper proposes GaussianFusion, a novel multi-sensor fusion framework for end-to-end autonomous driving. Unlike traditional attention-based flatten fusion or BEV fusion methods, GaussianFusion uses 2D Gaussian representations as intermediate carriers to efficiently and interpretably integrate features from multiple sensors. The architecture includes a dual-branch fusion pipeline to capture local multi-sensor features and to aggregate global planning cues, respectively. Evaluations on NAVSIM and Bench2Drive benchmarks show that GaussianFusion achieves state-of-the-art performance with improved planning accuracy and efficiency, demonstrating the promise of Gaussian-based representations for autonomous driving systems.

优缺点分析

Strengths

  1. The overall structure of the paper is clear and easy to understand.

  2. The paper presents an efficient Gaussian-based multi-sensor fusion framework, and experiment results on both open-loop and closed-loop benchmarks validate its effectiveness.

  3. The paper provides extensive ablation studies to validate and analyze the proposed modules.

Weaknesses

  1. Although authors state the differences compared to GaussianAD, the designs like multi-sensor fusion and 2D Gaussian cannot demonstrate advantages over GaussianAD. Besides, the authors seem not to demonstrate the necessity of using Gaussian representations. The proposed designs are not tailored for Gaussian representations, and can be easily integrated with other structures, like BEV-based methods. And the ablation in Tab.4 also shows that there doesn't appear to be a significant gap between Gaussian-based and BEV-based representations.

  2. Authors can provide some visualization results with more complex motion mode and some failure cases.

问题

please see weaknesses

局限性

yes

最终评判理由

Thanks for the authors' detailed responses. The advantages of Gaussian Fusion over previous methods are already clear, and my concerns have been addressed.

格式问题

none

作者回复

Q1. Concerns about the necessity and advantage of 2D Gaussian representations over existing methods.

Thank you for the thoughtful comments. Our proposed 2D Gaussian framework offers several key advantages over existing methods.

First, compared to GaussianAD [1], our method significantly reduces the number of Gaussians required to model a driving scene by compressing 3D Gaussians into 2D representations (25600 Gaussians in GaussianAD vs. 512 in ours). This leads to a substantially lower computational burden. Moreover, GaussianAD requires expensive 3D occupancy labels for optimization, while our method only uses 2D semantic maps—projected from road topology and bounding boxes onto the ground plane. These labels are widely available in existing driving datasets and can be more easily generated through automatic labeling pipelines.

Second, compared to BEV-based fusion methods [2], our 2D Gaussian framework improves both efficiency and accuracy. Specifically, our method reduces inference latency by 25.6%, while achieving better performance in perception and planning, as shown in Table 3 of the paper. In a typical BEV setup covering a range of [-32m, 32m] on both the x and y axes, a voxel size of 0.5m results in a [128, 128] query map—i.e., 16,384 BEV queries. In contrast, our approach uses only 512 deformable and position-adaptive 2D Gaussians, which are far more compact and flexible in representing the scene compared to dense BEV grids.

Q2. Authors can provide some visualization results with more complex motion modes and some failure cases.

Thanks for your suggestion. Due to the limitations of the current rebuttal submission, we are unable to include visualization results here. We will include the relevant visualizations—covering both complex motion scenarios and failure cases—in the appendix of the final version.

References

[1] Zheng W, Wu J, Zheng Y, et al. "Gaussianad: Gaussian-centric end-to-end autonomous driving". arXiv preprint arXiv:2412.10371, 2024.

[2] Li Z, Wang W, Li H, et al. "Bevformer: learning bird's-eye-view representation from lidar-camera via spatiotemporal transformers". IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

评论

Thanks for the authors' detailed responses. The advantages of Gaussian Fusion over previous methods are already clear, and my concerns have been addressed.

评论

We appreciate your thoughtful comments and constructive feedback. We are glad to hear that our responses have resolved your concerns. Given the improvements and clarifications provided, we hope you may consider updating your evaluation ranking score accordingly.

审稿意见
5

This paper introduces GaussianFusion, a novel Gaussian-based multi-sensor fusion framework for end-to-end autonomous driving. Existing methods often rely on flattened fusion or bird's-eye view (BEV) representations, which suffer from limited interpretability or high computational costs. The proposed framework innovatively uses 2D Gaussian distributions as an intermediate representation, initializing a set of Gaussians and iteratively refining them through multi-modal feature fusion. Explicit features capture scene semantics and spatial information, while implicit features are tailored for trajectory planning. The dual-branch fusion pipeline and cascade planning head further enhance scene understanding and trajectory optimization. Experiments on NAVSIM and Bench2Drive benchmarks demonstrate significant performance and robustness of GaussianFusion.

优缺点分析

Strengths

  1. Original Architecture Design: First to introduce Gaussian representations to multi-sensor fusion for autonomous driving, leveraging the sparsity and physical interpretability (e.g., mean, scale, rotation) of 2D Gaussians to reduce computational complexity while improving scene representation transparency.
  2. Task-Specific Dual-Branch Mechanism: Decouples scene reconstruction (explicit branch) and planning guidance (implicit branch), enabling Gaussians to simultaneously model the environment and directly inform decision-making.
  3. Efficient Cascade Planning: Iteratively refines trajectories through hierarchical Gaussian queries, showing stronger stability in complex scenarios (e.g., lane-keeping, obstacle avoidance) with notable gains in sub-metrics like DAC and LK.
  4. Cross-Benchmark Robustness: Outperforms existing methods in both open-loop (NAVSIM) and closed-loop (Bench2Drive) settings, validating generalization across diverse driving environments.

Weaknesses

  1. Limited 3D Scene Adaptability: The 2D Gaussian assumption may not adequately handle complex 3D environments (e.g., ramps, overpasses), potentially restricting real-world applicability.
  2. Incomplete Computational Analysis: While claiming reduced overhead through sparsity, the paper lacks detailed comparisons with BEV methods in terms of computational efficiency, especially in real-time driving scenarios.
  3. Insufficient Ablation Studies: Fails to isolate the contributions of the dual-branch design and cascade planning head, making it difficult to quantify the impact of individual components.

问题

  1. 3D Scene Extension: Can the current 2D Gaussian framework handle scenarios with vertical structures (e.g., viaducts, tunnels)? Are there plans to incorporate 3D Gaussians or hybrid representations?
  2. Computational Efficiency: Please provide inference time comparisons on embedded platforms (e.g., NVIDIA Jetson) to demonstrate real-world applicability.
  3. Ablation Study Gaps: Could you include ablation results for the dual-branch structure and cascade planning head to quantify their contributions to EPDMS/PDMS metrics?
  4. Dataset Limitations: Do NAVSIM and Bench2Drive cover extreme weather conditions? How robust is the framework under low visibility?

局限性

The authors do not adequately discuss the limitations of 2D Gaussians in complex 3D environments or the impact of sensor synchronization on fusion accuracy. Suggestions for improvement:

  1. Clarify the theoretical boundaries of 2D Gaussian modeling, such as the effects of slopes and elevation changes on trajectory prediction.
  2. Analyze how time synchronization errors between LiDAR and cameras affect Gaussian feature aggregation.

格式问题

No major formatting issues detected.

作者回复

Q1. Can the current 2D Gaussian framework handle scenarios with vertical structures (e.g., viaducts, tunnels)? Are there plans to incorporate 3D Gaussians or hybrid representations?

Our current 2D Gaussian framework primarily focuses on modeling foreground objects and road topology, which are critical for safe driving decisions. Since the output of the end-to-end planner is a 2D trajectory on the ground plane, we simplify the scene representation by assuming the ego vehicle operates on a relatively flat plane in its local coordinate system.

Although extending our framework to support full 3D Gaussians or hybrid representations is technically feasible, such extensions would incur prohibitive computational costs that would undermine the system's real-time performance. Therefore, we currently do not plan to extend our framework with 3D Gaussian representations.

Q2. The paper lacks detailed comparisons with BEV methods in terms of computational efficiency, especially in real-time driving scenarios.

Thank you for highlighting this important aspect. We fully agree that computational efficiency is critical for real-time deployment.

We compare the inference latency of a representative BEV-based method and our approach on an NVIDIA RTX 3090 GPU, as reported in the Table 4 of the paper. Our method achieves superior perception and planning performance while maintaining a lower inference latency of approximately 32 ms.

Given the unavailability of an embedded platform such as the NVIDIA Jetson AGX Orin, we conduct latency measurements on a desktop GPU (GTX 1080 Ti) as a proxy. This choice is justified by the fact that both hardware platforms exhibit similar computational capabilities for inference. Using the PyTorch implementation with float32 precision and without inference-specific optimizations, our model achieved an average inference time of 65 ms, which demonstrates the real-time applicability of our approach.

We acknowledge the importance of direct on-device benchmarking and plan to include such evaluations when the necessary hardware becomes available.

Q3. Could you include ablation results for the dual-branch structure and cascade planning head to quantify their contributions to EPDMS/PDMS metrics?

Thanks for the valuable suggestion. We conducted ablation studies to quantify the individual contributions of the dual-branch structure and the cascade planning head to the EPDMS metrics. The results are summarized in the table below:

Gaussian Exp. FusionGaussian Imp. FusionCascade PlanningAgent Pred.Param. (M)DAC↑TTC↑EP↑LK↑EPDMS↑
55.794.196.987.596.281.8
49.696.597.187.897.084.2
51.696.697.387.897.384.5
53.797.097.287.497.284.4
55.897.397.487.597.485.0
59.596.297.387.697.083.8

The results show that adding the Gaussian Explicit Fusion branch alone (second row) significantly improves perception performance, with DAC increasing from 94.1 to 96.5 and EPDMS from 81.8 to 84.2 compared to the baseline without either branch. Adding the Gaussian Implicit Fusion branch independently (third row) further enhances these metrics slightly, raising DAC to 96.6 and EPDMS to 84.5.

Separately, incorporating the cascade planning head with only the explicit fusion (fourth row) also improves trajectory prediction, increasing LK and EPDMS compared to using explicit fusion alone. The best overall performance is achieved when combining dual-branch fusion and the cascade head (fifth row), reaching an EPDMS of 85.0. We will update these ablation results to Table 3 of the paper.

Q4. Do NAVSIM and Bench2Drive cover extreme weather conditions? How robust is the framework under low visibility?

Thank you for this insightful question. The Bench2Drive benchmark includes scenarios with extreme weather conditions such as heavy rain and dense fog, offering a realistic testbed to evaluate robustness under challenging environments.

Our method demonstrates strong robustness in low-visibility conditions. As shown in the supplementary materials, a demo video illustrates our framework performing effectively in a scenario with dense fog and heavy rain. These results confirm that our approach can maintain reliable perception and planning performance despite adverse weather, underscoring its potential for real-world deployment in extreme conditions.

评论

Dear Reviewer,

There is an author-reviewer discussion stage before Aug. 6th. While you are positive to this paper, please also take a look at of the authors' rebuttal as soon as possible to see if you need further clarity. Please do not forget to leave a message to authors in case you still require additional clarity, or if you want to post points of disagreement for discussion with the authors.

AC

审稿意见
5

The paper presents GaussianFusion, a multi-sensor fusion framework for end-to-end autonomous driving. It uses learnable 2D Gaussian primitives to represent the driving scene, enabling efficient, interpretable fusion of camera and LiDAR inputs. The architecture separates scene understanding and planning via a dual-branch design and refines trajectories using a cascade planning head. Results on NAVSIM and Bench2Drive show improved planning performance over existing methods.

优缺点分析

Strengths

  • STRUCTURE and LANGUAGE: The paper follows a standard structure, which makes it easier to follow and read. Overall, the paper is well-written. The level of English is great. Figures are of good quality and improves the clarity. Formally, everything is on a high level.

  • STRONG REPRODUCIBILITY: The reproducibility is way above the standard. The code is provided, and implementation details (e.g., backbone, fusion modules, training settings, and hyperparameters) are clear.

  • METHOD DESIGN: The proposed methodology/architecture is well-motivated. Separating explicit spatial grounding (for map reconstruction) and implicit latent reasoning (for planning) is a clean and understandable. The use of 2D Gaussians seems like a good balance between efficiency and interpretability. The cascade planning head is well-integrated and shows clear performance gains.

  • EMPIRICAL VALIDATION: Experiments are thorough and wide. Authors evaluate both open-loop (NAVSIM) and closed-loop (Bench2Drive) performance. Ablations support each component’s contribution. The method achieves state-of-the-art or very competitive performance in several key planning metrics with fewer parameters.

  • QUALITATIVE ANALYSIS: Visualizations of the Gaussian refinement process and trajectory outputs offer intuitive insight into the model's "performance" and potential interpretability. The supplemented video is a cherry on top.

Weaknesses

  • INCREMENTAL NOVELTY: Although the framework is well thought out / developed, the core ideas (Gaussian scene representations, dual-branch pipelines, cascade planning) build upon existing methods (e.g., GaussianFormer, TransFuser, GaussianAD). The novelty lies in how these components are combined for E2E planning, not in any fundamentally new technique.

  • LIMITED STATISTICAL RIGOR: Like many NeurIPS submissions, the evaluation is based on single-seed results without any variance estimates, error bars, or confidence intervals (acknowledged by the authors). This is particularly important, given that some of the reported improvements over prior work (e.g., DiffusionDrive or ARTEMIS) are relatively small and could plausibly be attributed to randomness. Without repeated runs or statistical testing, it is difficult to assess whether the performance gains are robust or cherry-picked.

  • SPARSE REPRESENTATION TRADE-OFFS: While the use of sparse Gaussians improves computational efficiency, it is unclear how well this representation captures fine spatial details. Particularly in edge cases involving small or fast-moving objects, which is a major concern in end-to-end autonomous driving, where robustness to OOD objects (as emphasized in the COOOL benchmark [1, 2, 3]) is essential. Authors do not evaluate or discuss how well GaussianFusion handles such scenarios, nor compare it against denser representations.

[Additional Comments]:

  • Table and figure captions lack clarity: Table and Figure captions should be self-explanatory, i.e., understandable without needing to refer to the text. For Example, it is not clear what the colors and or the numbers in Figure 3 and Figure 5 represent.
  • About Interpretability: The claim that GaussianFusion is more interpretable is kind of intuitive, but no formal interpretability analysis (e.g., user study, attribution) is provided but could be beneficial for the contribution framing.
  • Inconsistent bolding in figure captions: The authors should standardize the use of bold text in figure / table captions. The inconsistance is not viaually apealing.
  • Incorrect citation formatting: The in-text referencing style does not follow NeurIPS guidelines. References should be (i) ordered numerically, i.e., not [8, 3, 5] but [3, 5, 8] and (ii) separated by commas (e.g., “[3, 5, 8]”), not semicolons.

[1] @article{alshami2024coool, title={Coool: Challenge of out-of-label a novel benchmark for autonomous driving}, author={AlShami, Ali K and Kalita, Ananya and Rabinowitz, Ryan and Lam, Khang and Bezbarua, Rishabh and Boult, Terrance and Kalita, Jugal}, journal={arXiv preprint arXiv:2412.05462}, year={2024} }

[2] @inproceedings{picek2025zero, title={Zero-shot hazard identification in Autonomous Driving: A Case Study on the COOOL Benchmark}, author={Picek, Lukas and Cermak, Vojtech and Hanzl, Marek}, booktitle={Proceedings of the Winter Conference on Applications of Computer Vision}, pages={654--663}, year={2025} }

[3] @article{shriram2025towards, title={Towards a multi-agent vision-language system for zero-shot novel hazardous object detection for autonomous driving safety}, author={Shriram, Shashank and Perisetla, Srinivasa and Keskar, Aryan and Krishnaswamy, Harsha and Bossen, Tonko Emil Westerhof and M{\o}gelmose, Andreas and Greer, Ross}, journal={arXiv preprint arXiv:2504.13399}, year={2025} }

问题

  • Q1: How does or would the GaussianFusion perform under real-world challenges such as fog, occlusions, or sensor noise? Any plans to test beyond simulation?
  • Q2: Do sparse Gaussians compromise resolution or miss small/fast-moving objects compared to BEV grids?
  • Q3: Does the use of sparse 2D Gaussians lead to reduced spatial resolution or difficulty in capturing small (e.g., dog on a leash, little kids) or fast-moving objects compared to dense BEV-based representations?
  • Q4: Do the implicit and explicit branches ever conflict or interfere? How is their interaction managed during training?

局限性

Partially yes. There is no dedicated Limitation section, just one sentence in the Conclusion where the authors mention the use of custom CUDA operations and acknowledge that performance may not be optimal. It would be great if authors include more points about, for example, (i) robustness to sensor failure, weather, or domain shift, (ii) generalization of 2D Gaussians (vs. 3D or hybrid methods), and (iii) failure cases.

最终评判理由

The authors have addressed my main concerns with a clear response and new results, including multi-seed experiments and further clarification of the sparse Gaussian representation. Their answers improve the paper’s clarity, especially regarding statistical reporting and empirical robustness. While some suggestions, such as real-world evaluation and formal interpretability analysis, remain as future work, these do not significantly affect my assessment of the technical quality or contribution. Reading other reviewers’ comments did not change my view. I will maintain my overall positive rating.

格式问题

The paper contains a few minor formatting issues related to referencing. The overall style, compared to other papers and template, is different, e.g., authors are using semicolons instead of commas and references are not ordered.

作者回复

Q1. Although the framework is well thought out / developed, the core ideas build upon existing methods.

We sincerely appreciate the reviewer's recognition of our overall framework design. Our approach is inspired by prior great works, including 3D Gaussian Splatting [1] and GaussianFormer [2]. Our key contribution lies in introducing 2D Gaussian scene representations as a unified bridge for multi-sensor feature fusion in end-to-end planning. Notably, we are the first (to our knowledge) to apply 2D Gaussians for driving scene representation—a design that simultaneously achieves three critical advantages:

  1. Flexible: Positions and shapes of 2D Gaussians can be dynamically adjusted during the feed-forward process.
  2. Efficient: A few hundred 2D Gaussians are sufficient to model an entire driving scene.
  3. Intuitive: 2D Gaussians can be directly visualized during the refinement process.

We believe this design provides a different and practical perspective for scene representation in autonomous driving.

Q2. The evaluation is based on single-seed results without any variance estimates, error bars, or confidence intervals.

Thank you for your valuable comment. Following your suggestion, we conducted multi-seed experiments (three times) on both the NAVSIM and Bench2Drive benchmarks. As shown in the tables below, we report the mean and standard deviation.

Table 1. Performance on the Navtest Benchmark.

MethodInputNC ↑DAC ↑TTC ↑EP ↑PDMS ↑
LTFCamera97.492.892.479.083.8
TransFuserC & L97.792.893.079.284.0
GoalFlowC & L98.393.394.879.885.7
Hydra-MDP++C & L98.396.094.678.786.5
ARTEMISC & L98.395.194.381.487.0
DiffusionDriveC & L98.296.294.782.288.1
GaussianFusion (Ours)C & L98.4±0.197.3±0.194.6±0.283.1±0.188.9±0.1

Table 2. Performance on the Bench2Drive benchmark. '*' denotes the results are reproduced by ourselves.

MethodDSSRMergeOvertakeEBrakeGiveWayTSignMean
PDM-Lite97.092.388.893.398.390.093.792.8
AD-MLP18.10.00.00.00.00.04.40.9
TCP40.715.016.220.020.010.07.014.6
VAD42.415.08.124.418.620.019.218.1
UniAD45.816.414.117.821.710.014.215.6
ThinkTwice62.433.227.418.435.850.054.437.2
DriveTransformer63.535.017.635.048.440.052.138.6
DriveAdapter64.233.128.826.448.850.056.442.1
TF++*76.9±0.954.0±1.048.8±2.237.6±7.064.2±8.450.0±0.059.7±6.152.0±3.0
GaussianFusion (Ours)79.1±1.154.4±2.636.6±3.364.4±2.366.5±6.453.3±5.860.8±6.456.3±3.7

On the NAVSIM dataset, our method demonstrates low variance, indicating strong robustness to random initialization. On the closed-loop Bench2Drive benchmark, although the variance is relatively higher, our method still consistently outperforms baseline methods on average, highlighting its stability and effectiveness across diverse conditions.

We will include these aggregated results and variance estimates in the final version of the paper.

Q3. The performance of Gaussian sparse representation on small and moving objects?

The proposed 2D Gaussian sparse representation is capable of effectively capturing objects of various sizes, including small and dynamic ones, due to its deformable and adaptive nature. Unlike BEV-based dense representations, which may suffer from resolution constraints or fixed grid structures, our 2D Gaussians allow for flexible allocation of representation capacity to fine-grained details.

As shown in Table 4, our method achieves higher semantic map quality compared to BEV-based baselines. Furthermore, as demonstrated in the supplementary demo video, the 2D Gaussians successfully capture moving objects such as pedestrians and vehicles, indicating their effectiveness in dynamic driving scenes.

Q4. How does or would the GaussianFusion perform under real-world challenges such as fog, occlusions, or sensor noise? Any plans to test beyond simulation?

We appreciate the reviewer’s thoughtful question. The robustness of GaussianFusion under adverse weather conditions has been partially validated using the CARLA-based Bench2Drive benchmark, which includes diverse environmental settings such as fog, rain, and cloudy skies. As shown in the supplementary demo video, our method performs robustly in a foggy and rainy scenario, indicating strong adaptability to challenging visibility conditions.

However, we have not explicitly evaluated the method under severe occlusions or sensor noise. We agree that assessing performance under these conditions is crucial for real-world deployment. We sincerely thank the reviewer for this insightful suggestion and will explore these factors in future work.

Regarding real-world testing, due to the high cost and complexity of such experiments, we currently do not have plans to move beyond simulation. Nevertheless, we believe that the insights gained from simulation benchmarks like Bench2Drive provide a valuable proxy for many real-world challenges.

Q5. Do the implicit and explicit branches ever conflict or interfere? How is their interaction managed during training?

We did not observe any conflicts or interference between the implicit and explicit branches in our experiments. On the contrary, as shown in the Table 3, incorporating the implicit branch consistently improves performance, indicating that the two branches provide complementary information rather than competing signals.

In our current implementation, their interaction is managed by simply concatenating their outputs in the trajectory prediction module. While this approach proves effective in our setting, it is not necessarily optimal. We plan to explore more effective interaction mechanisms in future work.

Q6. Table and figure captions lack clarity and exhibit inconsistent bolding.

Thank you for your helpful comment. We will carefully revise all table and figure captions to improve clarity, ensure consistency in formatting, and enhance overall readability in the final version of the paper.

Q7. The claim that GaussianFusion is more interpretable is kind of intuitive, but no formal interpretability analysis (e.g., user study, attribution) is provided, which could be beneficial for the contribution framing.

We agree with the reviewer that our paper lacks a formal interpretability analysis. In this work, our primary focus is on validating the effectiveness of the GaussianFusion framework in the planning task.

We argue that the proposed 2D Gaussian representation is inherently more intuitive than flattened feature vectors or dense BEV grids. It allows for visual inspection of spatial changes during the feed-forward process, which provides a more direct understanding of what the model attends to in a scene.

We acknowledge that incorporating formal interpretability evaluations (e.g., attribution maps or user studies) could strengthen our contribution, and we plan to explore this direction in future work.

Q8. Incorrect citation formatting: References should be ordered numerically and separated by commas.

Thank you for your kind suggestion. We will carefully revise all in-text citations to ensure they are numerically ordered and properly formatted, following the required style in the final version of the paper.

References

[1] Kerbl B, Kopanas G, Leimkühler T, et al. "3D Gaussian splatting for real-time radiance field rendering". ACM Transactions on Graphics, 2023.

[2] Huang Y, Zheng W, Zhang Y, et al. "Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction". ECCV, 2024.

评论

Thank you for your rebuttal.

The inclusion of multi-seed results addresses my concerns about statistical robustness, and your clarifications on method design and branch interaction are clear. I hope my comments and review will help you to strengthen your revised manuscript.

The following comments are not major, but should be addressed by the authors.

  • On the topic of interpretability, I find your response somewhat contradictory. You argue that 2D Gaussians are “inherently more intuitive,” but also acknowledge there is no formal interpretability analysis in the paper. If interpretability is an integral part of your contribution, it needs to be demonstrated / measured, not just claimed. I strongly recommend either toning down this claim in the manuscript or including at least some basic formal evaluation in the camera-ready version.

  • Similarly, the limitations regarding real-world robustness and the handling of small or fast-moving objects should be explicitly discussed (at least shortly) in the revised manuscript, not deferred to future work.

评论

Thank you for your helpful and professional comments. We would like to clarify that interpretability is not a contribution of our method. The 2D Gaussians serve as a visualization tool to illustrate the dynamics of scene representations during the feed-forward process. In response to your suggestion, we will revise the manuscript by replacing terms such as “intuitive” and “interpretability” with “transparency”, which we believe more accurately and objectively reflects our intended meaning.

We acknowledge that we have not conducted quantitative evaluations on tracking small or fast-moving objects. Similarly, our method has not been assessed under sensor occlusion or noise, which are critical factors for real-world deployment. Moreover, our current design depends on 3D bounding box and road topology annotations to optimize the 2D Gaussians. These limitations will be clearly discussed in a dedicated Limitations section in the revised manuscript.

We sincerely appreciate the reviewer’s input in helping us better define the scope and applicability of our work.

最终决定

This paper proposed a multi-sensor fusion framework for end-to-end autonomous driving based on 2D Gaussian representation. Experiments were conducted on NAVSIM and Bench2Drive to validating its planning performance for autonomous driving. After the rebuttal and author-review discussion phase, all the reviewers reached the consensus to accept the paper, based on its contribution about proposing a new fusion framework beyond existing BEV or attention-based ones. After reading the paper, reviews, authors’ feedbacks and discussions, the AC agree with the reviewers to recommend accept this paper.

Since the proposed new fusion framework would provide new insights for multi-sensor fusion, not restricting to autonomous driving, the AC would like to recommend spotlight.