PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差0.7
6
5
6
7
4.5
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

VeXKD: The Versatile Integration of Cross-Modal Fusion and Knowledge Distillation for 3D Perception

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

A simple yet versatile framework that jointly considers multi-modal fusion and cross-modal knowledge distillation.

摘要

关键词
3D PerceptionMulti-modal FusionCross-modal Knowledge Distillation

评审与讨论

审稿意见
6

The paper introduces a modality-general fusion teacher that narrows the gap between teacher and single-modal student models. This framework also includes a data-driven mask generation network that creates unique spatial masks for different feature levels and tasks. These masks enhance feature distillation by selectively transferring valuable information from the teacher’s feature maps.

优点

  1. The paper proposes the integration of cross-modal fusion and Knowledge Distillation in 3D perception, enhancing the efficacy of cross-modal KD through a modality-general fusion teacher.
  2. The paper develops a task- and modality-agnostic KD approach, making it highly versatile for any BEV-based 3D perception task and adaptable to various student modalities.

缺点

  1. The baseline is too low. More baselines should be included to validate the effectiveness, such as BEVDepth [1] and long-term temporal fusion settings [2].
  2. The related work section is incomplete and does not include some state-of-the-art or similar prior works [3,4]. Additionally, the state-of-the-art work VCD [3] focusing on L+C->C is not compared in Table 1.
  3. The novelty is limited. The Masked Feature Distillation is actually proposed in FD3D [5].

[1] BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection

[2] Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

[3] Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

[4] BEVSimDet: Simulated Multi-modal Distillation in Bird's-Eye View for Multi-view 3D Object Detection

[5] Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

问题

  1. The related work section should be more comprehensive, and the experiments should compare with state-of-the-art methods.
  2. More experiments with different baselines mentioned above should be conducted to validate the effectiveness of the methods.

局限性

The paper has discussed its limitations.

作者回复

1. Q1: Experiment on More Baselines

Thank you for your suggestion. We have conducted 3D object detection KD experiments on both BEVDepth and the temporally fused BEVFormer as student models. As shown in Table 1 attached to the global response, our KD framework was applicable to both models and improved their performance.

2. Q2: Modified Related Work

Thank you for your constructive suggestion and we have modified our related work. Due to space limitations and in order to avoid abusing the official comments, we abstract the modifications instead of posting the entire related work section.

  • Adding different operations on utilization of feature distillation

DistillBEV[5] decomposes the region of feature maps and enhances attention to false positive regions. Recently, inspired by pretext tasks in large language models, masked generative distillation has been proposed [1]. Unlike attentive distillation, generative distillation masks part of the student’s feature map using a random mask. This masked map is then processed with a generator network to reconstruct feature maps that closely approximate the teacher’s, thus enhancing knowledge transfer. However, random masks can destabilize algorithms, especially in 3D object detection with pronounced foreground-background imbalance. Zeng et al. [2] address this by using a learned distillation head to predict coarse foreground boxes where random mask distillation is applied, focusing more on foreground regions to enhance stability. Our method, while inspired by masked generative distillation, aligns more closely with attentive distillation and eliminates the need for additional generator networks.

  • The efficacy of KD

To boost KD efficacy by minimizing the modality gap, Huang et al [3] developed a "vision-centric" multi-modal teacher, reducing reliance on LiDAR model operations to align more closely with camera-based students. Conversely, SimDistill[7] adds a branch to the student model to simulate multi-modal processing, narrowing the teacher-student gap but increasing the student model's size and inference time, deviating from traditional KD goals. Our work focuses on developing a modality-general fusion model without altering the existing pipelines of either teacher or student networks.

  • More cross-modal research on camera-based students

Recent works have advanced short-term [8, 9] and long-term [4] temporal fusion in multi-camera 3D perception. Zheng et al.[6] use a long-term temporal fusion teacher to impart temporal cues to a short-term memory camera student, while work [3] warps time series ground truth into the current timestamp for long-term temporal supervision.

3. W3: Novelty in Masked feature distillation

Thank you for your thought-provoking comment. Indeed, our work shares some similarities with FD3D, as both are inspired by the work Masked Generative Distillation [1].

Both [1] and FD3D [2] are generative distillation in nature, where certain feature positions are masked and reconstructed from other feature locations to achieve KD. FD3D further generates bounding boxes through a learned distillation head, ensuring that only features within the bounding box are involved. In contrast, our method utilizes masks to selectively choose features to align pixel-wise akin to attentive distillation. This requiring only an auxiliary network for generating spatial masks, thereby eliminating the need for additional complexities of reconstruction.

Our approach also differs in mask generation: FD3D is instance-based, using BEV queries to guide distillation within coarse bounding boxes. We have adapted our method to address the semantic complexities of the BEV space, initializing and learning dense BEV queries for fine-grained, pixel-wise mask generation that interacts directly with both student and teacher feature maps, independent of bounding boxes. This allows for greater generalization across various downstream tasks besides object detection.

While both our method and FD3D utilize learned masks to identify useful feature locations and filter out noisy ones, FD3D focuses primarily on model compression in single-modality scenarios. Our study, however, considers cross-modal KD scenarios, including how to incorporate more modality-general information within the teacher and performing attentive distillation of this fine-grained modality-general information through a collaborative mask learning process involving student and teacher.

In summary, compared to the mentioned work, the novelties of our BEV guided masked learning and distillation include:

  1. Focusing on attentive over generative distillation.

  2. Our method generates masks based on interactions between the learned dense BEV queries and feature maps of both student and teacher, enhancing finegrained feature extraction and adaptability to various tasks.

  3. Our method explores cross-modal KD scenarios, aiming to enhance the teacher model with broader modality-general information and in return utilize the information contained in the learned BEV queries to perform mask guided selective attentive distillation.

References:

[1] Masked generative distillation

[2] Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

[3] Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

[4] Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

[5] Distillbev: Boosting multi-camera 3d object detection with cross-modal knowledge distillation

[6] Distilling temporal knowledge with masked feature reconstruction for 3d object detection

[7] Simdistill: Simulated multi-modal distillation for bev 3d object detection.

[8] Bevdet4d: Exploit temporal cues in multi-camera 3d object detection

[9] Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.

评论

The authors have addressed most of my concerns. However, some issues remain. Notably, as mentioned in the weaknesses section, the state-of-the-art work VCD [3], which focuses on L+C->C, is not compared in Table 1.

评论

I apologize for the delayed response; we encountered some issues with the data pipeline and version incompatibilities while modifying and running the code, but these have now been resolved. :)

We continued to use our multi-modal model as the KD teacher for the experiment on the student model "bevdet4d-r50-longterm-depth," which is a multi-camera student with a frame number of 8.

We have carefully reviewed the VCD paper and code, both of which specify that the student model is also "bevdet4d-r50-longterm-depth." However, the baseline VCD reproduced was slightly better than the results in the bevdet4d github repository. Due to time and resource constraints, we were unable to reproduce the training process for "bevdet4d-r50-longterm-depth" baseline and instead used the results directly from the bevdet4d code repository as our baseline. The comparison table is as follows.

MethodmAPNDSmAVEmAAEmASEmAOEmATE
bevdet4d39.451.528.220.628.146.957.9
+VCD42.654.026.820.727.143.354.7
+ VeXKD(Ours)42.853.529.621.427.545.255.2

As can be seen from the table, our method can be applied to a camera student with long-term temporal fusion and has some improvements in mAP and NDS when applied to bevdet4d. The improvement in mAP is slightly better compared to VCD, indicating that using multi-sweep lidar as a teacher can enhance the student's localization capabilities. However, the improvement in NDS is somewhat less compared to VCD, particularly because the mAVE deteriorated. We think this is due to bevdet4d's specialized optimizations for velocity surpassing the assistance from the multi-modal fusion teacher, and inaccuracies in velocity for new positive predictions could also contribute to the decline in mAVE.

Overall, our method is applicable and yields positive effects on a student with long-term temporal fusion, and we observed the effectiveness of designing explicit temporal fusion KD methods. We believe the main focus of our method should be on a simple and versatile KD framework for all modalities of students. As mentioned in the global response, once multi-modal explicit long-temporal fusion methods are mature and open-sourced, and a mainstream pipeline for long-temporal fusion operations is established, explicit temporal KD can serve as a pluggable module in a versatile KD framework.

评论

Thanks for your time and effort in addressing my concerns. I suggest the authors complete the experiments and include the VCD results in Table 1 to make the paper more comprehensive. Based on these improvements, I am inclined to change my rating to Weak Accept.

评论

Thanks again for your constructive suggestions. I will complete these parts. :)

审稿意见
5

This paper introduces VeXKD, a Versatile framework that integrates Cross-Modal Fusion with Knowledge Distillation for 3D detection tasks. The framework adopts a modality-general cross-modal fusion module to bridge the modality gap between the multi-modal teachers and single-modal students. Extensive experiments on the nuScenes dataset demonstrate improvements for 3D detection and BEV map segmentation tasks.

优点

Cross-modal fusion and knowledge distillation for 3D perception is a significant problem, and this paper proposes an alternative strategy to current methods. The motivation for the paper is reasonable and inspiring. The proposed versatile knowledge distillation method for different downstream 3D perception tasks, such as 3d detection and map segmantation, is innovative. This paper is well-written and easy to follow.

缺点

1.As shown in Table 1, CenterPoint+VeXKD (L+C->L) achieves lower detection accuracy than TransFusion-L (L). Then does it make sense to use a multi-modality teacher model to guide a lidar-based student model? 2.The inference speed of the student model needs to be given and compared with existing state-of-the-art models and real-time inference models on detection and map segmentation tasks.

问题

  1. As shown in Table 1, CenterPoint+VeXKD (L+C->L) achieves lower detection accuracy than TransFusion-L (L). Then does it make sense to use a multi-modality teacher model to guide a lidar-based student model?
  2. The inference speed of the student model needs to be given and compared with existing state-of-the-art models and real-time inference models on detection and map segmentation tasks.
  3. What effect does temporal information have on the final results? (No experimental results are needed, just an exploratory question.)

局限性

None

作者回复

Q1: CenterPoint+VeXKD (L+C->L) Compared to TransFusion-L (L)

Thank you for your question. To ensure fair comparisons with existing KD methods like S2M2-SSD and UniDistill, we chose CenterPoint as the LiDAR student model. CenterPoint inherently underperforms compared to TransFusion-L by 5.2 mAP and 7.1 NDS, largely because TransFusion-L utilizes a more advanced DETR head. However, as shown in the modified Table 1 attached to our global response, TransFusion-L operates nearly twice as slow as CenterPoint in terms of FPS, due to the time-consuming attention operations. By applying cross-modal KD to CenterPoint, while preserving its original fast inference speed, we brought its performance closer to that of TransFusion-L, thus achieving a more favorable balance between precision and real-time performance. Therefore, cross-modal KD does make sense for LiDAR-based student models.

Q2: The inference speed comparison

Thank you for your constructive feedback. As revised in the global response, we have added a comparison of the number of inference floating point operations(FLOPs) and required time for different models to Table 1. This adjustment allows for a clearer view of the performance-realtimeness tradeoff offered by knowledge distillation. Once again, we appreciate your suggestions. :)

Q3: Incorporation of Temporal Information

Thank you for your constructive questions. As clarified in the global response, our teacher model and LiDAR student models inherently incorporate multi-sweep LiDAR inputs, thus implicitly integrating temporal information. Inspired by your feedback, we conducted additional experiments with BEVFormer, a camera-based student model that explicitly incoporate temporal fusion operation, and observed performance gains, as detailed in the global response. As technologies for multi-modal explicit temporal fusion continue to mature and become open-sourced, we foresee the potential for integrating explicit temporal KD operations as a pluggable module into the VeXKD framework in future developments.

评论

Thanks for your response. The rebuttal has addressed most of my concerns. I would like to keep my original rating.

评论

We are very glad to hear from you and have your concerns addressed. Thanks again for your time and suggestions.

审稿意见
6

This paper presents VeXKD, an innovative framework that combines Cross-Modal Fusion and Knowledge Distillation (KD) to significantly enhance 3D perception capabilities. VeXKD employs knowledge distillation on BEV feature maps, facilitating the seamless transfer of multi-modal insights to single-modal student models without incurring additional computational overhead. The framework incorporates a versatile cross-modal fusion module designed to bridge the performance gap between multi-modal teacher models and their single-modal counterparts. Extensive experiments conducted on the nuScenes dataset have yielded substantial performance improvements, effectively reducing the disparity with state-of-the-art multi-modal models.

优点

  1. The integration of Cross-Modal Fusion with Knowledge Distillation in the domain of 3D perception is a novel approach that offers a fresh perspective for enhancing single-modal models.

  2. The experimental outcomes on the nuScenes dataset, which show significant improvements in key metrics such as mAP, NDS, and mIoU, substantiate the effectiveness of the proposed methodology. Extensive ablation studies have also demonstrated the effectiveness of each module.

  3. The framework is designed to be modality- and task-agnostic, capable of being applied to various student modalities and downstream tasks without being constrained by specific network architectures or processing steps, ensuring the universality of the approach.

缺点

  1. The framework's heavy reliance on BEV feature maps might limit its applicability to other types of feature representations.

  2. The paper's strategy for selecting teacher models is rather limited, lacking a comprehensive comparative analysis of the method's performance and effectiveness across various teacher model configurations.

  3. The authors have not evaluated whether the proposed distillation method remains effective when incorporating temporal information.

  4. Some of the tables lack sufficient detail, which affects readability. For instance, adding the types of teacher models in the caption of Table 1 would provide clearer context and enhance the table's usefulness.

问题

See weakness.

局限性

The authors have discussed the limitations of their work in the paper.

作者回复

1. Q1: Adaptation on other feature representation

In our manuscript, the experiments were conducted on the BEV feature map. The BEV feature space has become a focal point of research in recent years due to its favorable compatibility with multiple modalities and its similar processing pipeline.

However, as long as the student and teacher models can achieve spatial alignment, the specific feature space in which KD is conducted would not impact its effectiveness that much empirically. Our approach of masked distillation is inspired by previous KD work conducted in the RGB image feature space [1] and has been adapted to address the BEV space's semantic complexity by initializing and learning dense BEV queries. Similarly, the methodology we have developed for masked feature distillation and the construction of a modality-general fusion model can be adapted to other feature spaces, such as RGB or depth image features.

2. Q2 & Q4: Add teacher model type column to KD methods & The Comparison of different teacher model

Supplementary Table: Overview of Teacher Models Used in Knowledge Distillation Methods

MethodModalityTeacher Model
CenterPointL--
+ S2M2-SSDL+C -> LMulti-modality SSD
+ UnidistillL+C -> LBEVFusion
+ VeXKD(Ours)L+C -> LModality-General Fusion Teacher (Ours)
BEVDet-R50C--
+UnidistillL+C -> CBEVFusion
+VexKD(Ours)L+C -> CModality-General Fusion Teacher (Ours)
BEVFormer-SC--
+UnidistillL+C -> CBEVFusion
+BEVDistillL -> CObject DGCNN[3]
+VexKD(Ours)L+C -> CModality-General Fusion Teacher (Ours)

Thank you for your constructive suggestions, which prompted me to add a table representing the teacher models used in various KD methods. As indicated in the table, most KD methodologies consider the architectural similarity between the teacher and student models when choosing the teacher model. For instance, S2M2-SSD employs a multi-modal teacher model similar to PointPainting[2], which fuses image segmentation results with LiDAR features before voxelization to supervise LiDAR student right from voxelization process. Unidistill adopts the BEVFusion model as a teacher, which aligns structurally with single-modal students. BEVDistill ensures structural similarity with the BEVFormer student by using Object DGCNN[3] as LiDAR teacher, which is based on the DETR encoder and decoder architecture.

In our VeXKD study, we adopted a BEVFusion pipeline similar to Unidistill but modified the fusion module to enhance the teacher's efficacy in KD. This modification was inspired by observing performance gaps between Unidistill and those KD methods exclusive to camera students like BEVDistill. Our ablation study also reveals that a significant portion of the performance gain is attributable to these modifications in the teacher models.

Additionally, when implementing our research, we explored building a multi-modal teacher model using the global attention method described in [4]. However, replicating the fusion teacher with this method resulted in significant GPU memory consumption, challenging the training process. This experience inspired the replacement of global attention with more efficient deformable attention for modality-general fusion.

As mentioned in the limitations of our paper, the lack of research quantifying the modality-general information contained in teacher models makes the process of experimenting with different teacher models time-consuming, effort-intensive, and fraught with uncertainty. Our paper aims to provide an example schema for extracting modality-general information from teacher models without adapting the teacher's model architecture. We hope that future research will include more theoretical analyses on the different teachers in cross-modal KD.

3. Q3: Incorporation of temporal information

Thank you for your suggestions. As clarified in the global response, both our teacher model and LiDAR students inherently incorporate multi-sweep LiDAR inputs, thereby implicitly integrating temporal information.

Inspired by your feedback, we conducted additional experiments with BEVFormer, a student model that combines temporal information, and observed performance gains, as detailed in the global response. As technologies for multi-modal explicit temporal fusion continue to mature and become open-sourced, we foresee the potential for integrating explicit temporal KD operations as a pluggable module into the VeXKD framework in future developments.

References

[1] Huang, Tao, et al. "Masked distillation with receptive tokens." arXiv preprint arXiv:2205.14589 (2022).

[2] Vora, Sourabh, et al. "Pointpainting: Sequential fusion for 3d object detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[3] Wang, Y., & Solomon, J. M. (2021). Object dgcnn: 3d object detection using dynamic graphs. Advances in Neural Information Processing Systems, 34, 20745-20758

[4] Man, Yunze, Liang-Yan Gui, and Yu-Xiong Wang. "BEV-guided multi-modality fusion for driving perception." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

评论

Thanks for your detailed response. Since most of my concerns are addressed, I would like to raise my score to weak accept :)

评论

We are very pleased to have addressed your concerns, and we appreciate the constructive feedback you have provided. :)

审稿意见
7

This paper proposes VeXKD, a method that performs Knowledge Distillation (KD) in the BEV feature space. By distilling cross-modal knowledge from a teacher model into a single-modal student model, VeXKD eliminates the need for additional inference time overhead. The distilled student model can be adapted to various tasks by attaching task-specific heads.

优点

  1. The paper employs a cross-modal KD approach to transfer insights from a multi-modal teacher model to a student model without adding extra overhead. Experimental results demonstrate that the student model’s performance significantly improves.
  2. The paper introduces a mask generation module, which ensures that only useful information is transferred during the KD process using these masks.
  3. The proposed KD method is highly flexible, capable of handling various modality inputs and downstream tasks.

缺点

  1. The student and teacher structures must adhere to the design shown in Appendix Figure A.1, which is not compatible with models that do not use the BEV feature space.
  2. The masked teacher perception loss requires task-specific training, hindering the possibility of performing new tasks in a zero-shot manner.

问题

  1. Building on Weakness 1, is there an easy way to apply this pipeline to a model that does not use the BEV feature space?
  2. If possible, could the authors provide results for other modalities? If time or computational resources do not permit, could the authors clarify what additional efforts are required to apply this method to other modalities?

局限性

The authors have addressed the limitations of their work in the checklist, and there are no significant concerns regarding potential negative societal impacts.

作者回复

Q1: Adaptation on other feature representation

In response to your valuable insights, we would like to offer some clarifications. When conducting KD, it is crucial for the feature maps of both the student and the teacher to reside within the same feature space to ensure spatial and semantic compatibility. In this regard, the BEV feature space has gained significant attention in recent years due to its favorable compatibility with multiple modalities and the similar perception pipeline across modalities in 3D perception. Indeed, the volume of research focused on the BEV feature space far surpasses that on other feature spaces in recent years.

Furthermore, if student and teacher feature maps are in different feature spaces, projection can be used to align them. Once the spatial alignment is done, the specific feature space used for KD theoretically and empirically would not impact its effectiveness that much. For example, our masked distillation is inspired by previous work in the RGB image feature space[1] and is adapted to better address the semantic complexity of the BEV space through strategies like the initalizing and learning of dense BEV queries. Similarly, the methodologies we have developed for masked feature distillation and the construction of a modality-general fusion model can be adapted to other feature spaces, such as RGB or depth image features.

Q2: Adaptation on other modalities

We appreciate the opportunity to further clarify the adaptability of our methods with the inclusion of additional modalities. Our methodology is designed to handle each modality symmetrically, ensuring that integrating new modalities does not necessitate alterations to the existing codebase, but rather requires configuration updates to accommodate them. Additionally, these new modalities should be integrated into the training of the new fusion teacher to extract the modality-general information from new modalities.

Once the teacher model is trained, the masked feature distillation operation can be applied to facilitate specific feature mask learning and selective feature distillation for the newly integrated student modality model. For the student model, augmenting the existing task loss with the KD loss is sufficient to complete its training.

It is worth noting that our framework primarily conducts KD within the BEV feature space for LiDAR and camera students. However, BEV feature space is compatible with various modalities, including raw mmWave radar points[2, 3].

Incorporating additional modalities necessitates training the new fusion teacher and new student models, adding the data processing operation for new modalities. Completing this retraining process within the rebuttal period is challenging. We hope this clarification is helpful and addresses your concerns effectively.

References

[1] Huang, Tao, et al. "Masked distillation with receptive tokens." arXiv preprint arXiv:2205.14589 (2022).

[2] Stäcker, Lukas, et al. "RC-BEVFusion: A plug-in module for radar-camera bird’s eye view feature fusion." DAGM German Conference on Pattern Recognition. Cham: Springer Nature Switzerland, 2023

[3] Harley, Adam W., et al. "Simple-bev: What really matters for multi-sensor bev perception?." 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023.

评论

Thanks for your explanation. That addressed my concerns.

评论

We are very glad to have your concerns addressed. Thanks again for your time and suggestions.

作者回复

Thank you to all the reviewers for your constructive comments and feedback, which have been invaluable in allowing us to improve our work.

We appreciate the recognition from all reviewers of the main contributions of this paper, including the versatility of the proposed knowledge distillation (KD) framework, which is both modality- and task-agnostic. Additionally, the integration of cross-modal KD and fusion to enhance the efficacy of the teacher in the KD process has been acknowledged(GBji, uk3B, Vuwz). Furthermore, the use of the masked distillation module to create unique masks for different feature maps, thereby facilitating the thorough mining and transfer of useful information contained in the teacher's feature map, has been positively noted (dGF9, uk3B, Vuwz).

In response to the constructive suggestions provided, we have adopted and made corresponding adjustments during the rebuttal period.

1. Addition of GFLOPs and Inference Speed Column to Table 1:

Thanks to the constructive suggestion raised by reviewer uk3B regarding the inference computational resources and time required for different models, we utilized the open-source tool calflops to calculate the giga floating-point operations (GFLOPs) needed by all models mentioned in the comparative experiments during inference. We also conducted and compared inference time statistics on a commonly used GPU RTX 4090, and have included these results in the comparative experiments table attached. This addition more clearly illustrates the performance-realtimeness tradeoff brought about by the proposed cross-modal KD on the student model.

2. Revision of the Related Work Section on Knowledge Distillation:

Thanks to the excellent papers mentioned by reviewer uk3B. We have thoroughly reviewed the recent literature and updated the related work section on knowledge distillation. The revised section now offers a more comprehensive overview of the latest developments in knowledge distillation, with a specific emphasis on cross-modal applications. This update ensures that our paper comprehensively reflects the current state of the field and its recent advancements, providing readers with a clearer understanding of the evolution and research gaps in cross-modal KD.

3. Clarification of Temporal KD and Additional Experimental Results on Temporal Camera Students:

In response to the comments regarding temporal KD, we would like to clarify the integration of temporal information. The teacher model used in our cross-modal KD inherently incorporates temporal information. Here is a more detailed analysis:

  • The LiDAR data pipeline naturally integrates multiple sweeps, embedding temporal information at the input level. This means that even without explicit temporal operations, our multi-modal teacher model—and LiDAR-only student models like CenterPoint—already leverage temporal cues.

  • This is why influential work explicitly engaging in temporal fusion operations is predominantly focused on the camera branch, including BEVFormer[1], FB-BEV[2], PETRv2[3] and so on. To assess our framework's impact on multi-camera students employing explicit temporal fusion, we conducted supplementary experiments using BEVFormer as the student model on the nuScenes val set during the rebuttal period. The results in the attached Table 2 demonstrate that the implicit temporal information in the teacher model, along with modality-general information, enhances student model performance. These results are detailed in the attached PDF.

  • The field of BEV perception is rapidly evolving, with recent projects like BEVFusion4D[4] and FusionFormer[5] starting to explore explicit multi-modal temporal fusion. However, these projects are not yet open-sourced, complicating their use as foundational teacher models for guiding various student modalities. The field of multi-camera perception is introducing diverse explicit temporal fusion operations at different feature levels, such as BEV-based, proposal-based, and query-based methods. Exploring the commonalities among those temporal fusion approaches could be challenging and require further study. As multi-modal temporal fusion research advances and becomes open-sourced, integrating explicit temporal KD operations as pluggable components into the VexKD framework could represent a promising direction for future research.

Please see the attached PDF with the modified tables and added experimental results and the reviewer-specific rebuttals for more information. Finally, I would like to extend my gratitude once again to all the reviewers for their valuable feedback and suggestions.

Reference

[1] Li, Zhiqi, et al. "Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.

[2] Li, Zhiqi, et al. "Fb-bev: Bev representation from forward-backward view transformations." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[3] Park, Jinhyung, et al. "Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection." The Eleventh International Conference on Learning Representations. 2022.

[4] Cai, Hongxiang, et al. "BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via Cross-Modality Guidance and Temporal Aggregation." arXiv preprint arXiv:2303.17099 (2023).

[5] Hu, Chunyong, et al. "FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection." arXiv preprint arXiv:2309.05257 (2023).

评论

Dear Reviewers,

Thank you very much again for performing this extremely valuable service to the NeurIPS authors and organizers.

As the authors have provided detailed responses, it would be great if you could check them and see if your concerns have been addressed. Your prompt feedback would provide an opportunity for the authors to offer additional clarifications if needed.

Cheers,

AC

最终决定

All reviewers found the studied problem to be important, the proposed method to be effective, and the results promising. The rebuttal successfully addressed reviewer comments and all reviewers recommend acceptance. The authors are encouraged to improve the final paper version by following reviewer recommendations.