DynamicBEV: Leveraging Dynamic Queries and Temporal Context for 3D Object Detection

Jiawei Yao,Yingxin Lai

OpenReview PDF

提交: 2023-09-16更新: 2024-03-26

摘要

关键词

3D object detectionBird's Eye View

评审与讨论

审稿意见

评分: 3置信度: 42023-10-30

This paper proposes a lightweight and effective method for aggregating BEV pillar features using K-means clustering and Top-K Attention. The authors also introduce a Diversity Loss to prevent the attention mechanism from focusing too heavily on the most relevant features. The proposed method is evaluated on the nuScenes dataset and outperforms previous methods.

优点

The proposed clustering and top-K attention mechanism are simple and intuitive, yet achieve strong performance compared to previous state-of-the-art methods (Table 1). Extensive ablation studies in Section 4.4 demonstrate the benefits of the proposed modules.

缺点

Latency: Section 3.3 states that "the computational efficiency of DynamicBEV is one of its key advantages”. However, not all floating-point operations (FLOPs) are created equal, especially for the clustering operation on GPUs, TPUs, and other edge devices. It would be helpful if the authors could measure the latency of the full model and provide a breakdown of the latency of each component (e.g., clustering, sorting top-k).

Generalization: Evaluating the proposed method on only one dataset is not sufficient. I suggest evaluating the proposed method on at least one additional dataset.

Visualization: Could the authors provide detailed illustrations on K-mean clustering and Top-K attention in Fig1? Fig 2 is not clear. What does each color mean?

问题

Please see weaknesses.

审稿意见

评分: 5置信度: 32023-10-31

The authors explore and analyze the existing query-based paradigm for 3D BEV-based object detection, and propose to adopt dynamic queries to do temporal feature extraction. The experimental results on nuScenes, show the effectiveness of the proposed method.

优点

The task of query-based paradigm for 3D BEV-based object detection is popular and interesting in the 3D community. The authors propose to use dynamic queries to do feature learning and make it work on nuScenes.

缺点

Performance difference on large-scale vs small-scale objects. It would be interesting if the authors could show the detailed detection performance of 10 classes on the nuScenes. From my understanding, the proposed method is kind of sensitive to different objects with different sizes.
It is unclear to me how to define the size of associated feature cluster, and the number of the query.
More quantitative/qualitative results. The manuscript does report detection numbers on nuScenes validation set, however, the authors forgot to compare their methods with recent SOTAs on the test set. Also, it would be much convincing if the authors can present some qualitative results or report the results on more public datasets, i.e., KITTI or Waymo.
I am curious about the inference time of the proposed method. The authors repeatedly claimed that the traditional temporal fusion is heavy computation, however, the attention computation is also heavy from my understanding.

问题

Please refer to the weakness part.

审稿意见

评分: 5置信度: 42023-11-01

This paper presents dynamic queries for 3D object detection in bird's-eye view, distinguishing it from the static queries employed in SparseBEV. To enhance the model's performance, the authors have introduced K-means clustering and Top-K Attention mechanisms, which facilitate the integration of global features into the queries. Additionally, the paper introduces a diversity loss to encourage queries to focus on all clustered features. Then, a Lightweight Temporal Fusion Module is illustrated to speed up multi-frame fusion by using pre-computed features.

优点

I would like to compliment you on the clear and concise language used throughout the manuscript, as well as the well-designed figures and tables. These elements greatly enhance the readability and understandability of the paper.
It is clever to use clustering attention to reduce the computation cost of global attention.

缺点

The proposed dynamic queries is not new. Prior research, such as CMT[1] and UVTR[2], has already demonstrated the adjustment of queries in each decoder layer. Moreover, CMT employs global attention, while UVTR utilizes local attention to update query features, raising questions about the novelty of the proposed dynamic queries.
There are some experiment omissions that limit the comprehensiveness of the evaluation. Notably, there is a lack of crucial comparisons, such as latency comparisons with SparseBEV and an analysis of the performance-to-latency trade-off when employing clustering attention in contrast to the global attention mechanism used in CMT.

[1] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection, in ICCV 2023. [2] Unifying Voxel-based Representation with Transformer for 3D Object Detection, in NeurIPS 2022.

问题

Please see the Weaknesses.

审稿意见

评分: 5置信度: 42023-11-03

The authors introduce a new paradigm in DynamicBEV, a novel approach that employs dynamic queries for BEV-based 3D object detection. The proposed dynamic queries exploit K-means clustering and Top-K Attention creatively to aggregate information more effectively from both local and distant features, which enables DynamicBEV to adapt iteratively to complex scenes. LTFM is designed for efficient temporal context integration with a significant computation reduction.

优点

k-means clustering determines how pillars fit into localized patterns and features in 3D space, facilitating a detailed understanding of the characteristics of the object.
Diversity loss ensuring that the model is not overly focused on dominant features promotes a balanced focus on the clustering of various features
LTFM embodies the essence of computational efficiency and relieves the need for resource-intensive operations by leveraging existing calculations to manage temporal context
DynamicBEV outperforms sota methods on the nuScenes validation dataset

缺点

Missing nuScenes test results and the paper is difficult to understand and lacks the necessary visualizations
What do the surrounding features mean, and can authors explain dividing the surrounding features F of each query into K clusters C1, . . . , CK? Why use k-means to cluster features? And Fig.3 (a) does not show a big gap between k=5,6 or 7. What is aggregate based on? Is it the distance between features?
Why use tok-k attention? If the authors want local information, deformable attention is a choice.
The authors use Iterative Update and repeat the K-means clustering and Top-K Attention steps. So, I think the authors should report the inference speed.
For LTFM, authors should compare with StreamPETR for a fair comparison.

Minor: Fig.2 shows the difference between static query and dynamic query-based methods but lacks detailed explanations. (Similar in Fig.1)

问题

see weakness