5.3

/10

Poster4 位审稿人

最低4最高6标准差0.8

3.8

置信度

正确性3.0

贡献度2.5

表达2.8

NeurIPS 2024

LION: Linear Group RNN for 3D Object Detection in Point Clouds

Zhe Liu,Jinghua Hou,Xinyu Wang,Xiaoqing Ye,Jingdong Wang,Hengshuang Zhao,Xiang Bai

OpenReview PDF

提交: 2024-04-27更新: 2024-11-06

摘要

关键词

3D Object DetectionLinear RNN

评审与讨论

审稿意见

评分: 5置信度: 42024-07-09

This paper proposes to leverage linear networks such as RWKV and Mamba to capture long-range dependencies in LiDAR-based outdoor 3D object detection, leading to relatively larger group sizes of the voxel partition. The proposed techniques include voxel merging/expanding and voxel generation. Experiments are conducted on the Waymo Open dataset and nuScenes dataset, achieving state-of-the-art performance.

优点

This paper is an early attempt to utilize linear RNN for outdoor LiDAR-based 3D object detection.
LION achieves state-of-the-art performance on the mainstream datasets.
LION can be built upon multiple linear RCNNs such as mamba, RetNet, and RWKV, showcasing the universality of the proposed framework.

缺点

Some claims are obscure and not well supported by evidence, such as: "However, effectively applying linear group RNN to 3D object detection in highly sparse point clouds is not trivial due to its limitation in handling spatial modeling." Why is it not trivial? What is the limitation? What is spatial modeling?
The proposed techniques are not novel. The spatial descriptor is a common spconv-based module. Voxel merging and expanding are trivial. The voxel generation is new but also similar to some methods, such as the "virtual voxel" in "fsdv2: Improving fully sparse 3d object detection with virtual voxels". It would be better to conduct a discussion.
The ablation for larger group size is not sufficient. The authors should conduct experiments with different group sizes to showcase the workings of group sizes since it is a main claim.
Since the most significant advantage of utilizing linear RCNN is efficiency, the authors are encouraged to conduct a more detailed runtime evaluation to reveal the latency of each component and how the designs affect the efficiency.

问题

It seems like the multiple diffused voxels can occupy a single position. How do you handle this situation?
How many new voxels will be generated? Does the voxel generation reduce the efficiency?

局限性

The authors adequately addressed the limitations.

作者回复

2024-08-07

W1: Some claims are obscure and not well supported by evidence ...

Before feeding voxel features into linear RNN, we need to flatten 3D voxel features into 1D sequence features. Unlike the common 3D sparse convolution operation that directly deals with 3D voxel features in 3D space, the linear RNN only processes 3D voxel in 1D space, bringing limitations in handling spatial modeling. Here, the limitation is that the linear RNN is a sequence model, which is less effective in perceiving 3D spatial information (e.g. two adjacent voxels). Spatial modeling means capturing the local 3D geometric information for each voxel.

W2: The proposed techniques are not novel.

Although the 3D spatial descriptor consists of a common spconv-based module, it is crucial to help our linear RNN-based network to better capture 3D local spatial information. Therefore, the combination of 3D spatial descriptor and linear RNN operators could make each voxel feature perceive the local spatial information and long-range relation, which is important to improve detection performance.
Voxel merging and expanding operations are applied to reduce the computation cost without harming the detection performance. But we do not claim these operations are our contributions (Refer to L63-71 of the main paper).
For voxel generation, we make the first attempt to leverage the auto-regressive properties of linear RNNs (refer to Table 5 in the main paper) to generate new voxel features for 3D detection. This is different from the mentioned method FSDv2. Specifically, FSDv2 first votes center points and converts these points to virtual voxels. For virtual voxel features, FSDv2 aggregates these point features by MLP. Therefore, we could categorize it as a KNN manner for voxel diffusion (Refer to the experiments in Table 5 of the main paper). Finally, we will add this discussion in the revised paper.

W3: Ablation for larger group size.

Thanks. We provide the ablation studies of different group sizes in the following table. Here, we set a minimum group size of 256 for all our four LION blocks (Baseline: [256, 256, 256, 256]). We could observe that the manners with larger group size (i.e, II, III, IV, V) bring consistent performance improvement over the baseline (I). However, we carefully find that there is a drop in performance from IV to V by further enlarging the group size. This might lead to less effective retention of important information in excessively long sequences due to the limited memory capacity of linear RNNs.

#	Group Size	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
I	[256, 256, 256, 256]	65.6/65.2	72.3/65.0	68.3/67.2	68.8/65.8
II	[1024, 512, 256, 256]	66.9/66.5	74.9/69.6	70.8/69.8	70.9/68.6
III	[2048, 1024, 512, 256]	66.7/66.3	74.9/69.7	72.2/71.2	71.3/69.1
IV	[4096, 2048, 1024, 512]	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3
V	[8192, 4096, 2048, 1024]	66.5/66.1	74.6/69.5	71.6/70.6	70.9/68.7

W4: More detailed runtime evaluation.

We provide the detailed latency of each part of our LION-Mamba. Considering that some linear RNN-based operators (Mamba, Retnet, or RWKV) are usually applied to replace Transformer in modeling long sequences thanks to their efficiency. Therefore, we sufficiently compare the latency with transformer-based backbone (DSVT) to illustrate the efficiency of our LION. Here, we evaluate the latency on one NVIDIA GeForce RTX 3090 with a batch size of 1. Due to the quadratic complexity of the transformer, DSVT adopts a small group size of 48 in their paper. When we increase the group size of DSVT to 256, the latency obviously increases and the corresponding memory is even unacceptable (produces about 20G GPU memory for inference and fails to train this model due to OOM). In contrast, benefiting from the high efficiency of the linear RNN in modeling long sequences, our LION could adopt a larger group size (4096) for feature interaction while maintaining acceptable latency (146.2 ms) and low GPU memory (about 3G during the inference).

Component	Voxel Extraction (ms)	3D Backbone (ms)	BEV Backbone (ms)	Detection Head (ms)	Total Latency (ms)	mAPH (L2)
DSVT(official paper)	7.5	82.8	28.6	17.8	136.7	72.1
DSVT(256 group)	7.3	164.4	28.6	17.8	218.1	OOM
LION-Mamba	4.3	97.1	27.1	17.7	146.2	73.2
LION-Mamba-L	5.6	136.0	32.2	17.5	191.3	74.0

Furthermore, we provide a detailed ablation study on the Waymo validation set with 20% training data to analyze the effect of each component on the efficiency and performance.

3D Spatial Descriptor	Voxel Generation	Latency (ms)	mAPH (L2)
		123.2	65.8
√		131.3	68.6
√	√	146.2	69.3

Q1: It seems like the multiple diffused voxels can occupy a single position.

We apologize for the unclear presentation. We merge the diffused voxels and raw voxels by summing these voxels at the same position in the voxel merging operation to deduplicate voxels. We will make it clearer in the revised version.

Q2: How many new voxels will be generated? Does the voxel generation reduce the efficiency?

Actually, the number of new generated voxels depends on the number of input voxels and the diffusion ratio. Moreover, the voxel generation does reduce the efficiency of the whole network. To better illustrate the effect of voxel generation, we provide the ablation study on the important hyper-parameter of diffusion ratio in the following table. We find that a larger diffusion ratio brings more latency but better performance. In this paper, we set the diffusion ratio as 0.2 to trade off the performance and latency.

ratio	Latency (ms)	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
0	131.3	66.5/66.1	74.8/69.6	70.9/70.0	70.8/68.6
0.1	141.0	66.9/66.5	75.0/69.8	71.5/70.5	71.1/68.9
0.2	146.2	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3
0.5	158.9	67.2/66.8	75.3/70.0	72.1/71.1	71.5/69.3

2024-08-11

Dear Reviewer,

Thank you very much for your valuable reviews and comments. We have carefully addressed the concerns you raised and hope that our responses satisfactorily resolve them.

If you have any questions or need additional clarification after reading our rebuttal, please do not hesitate to let us know during the discussion period.

Thank you once again for your time and consideration.

评论- Thanks for your response

2024-08-14

Your response addresses many of my concerns. And I'd like to know if you have a plan to release the full code. Since your method achieves superior performance, it is hard for the following work to follow or compare with your method if the code is not available.

2024-08-14

Sincerely thank you for your comments! We promise to release all codes and models for this community by September 30, 2024. Would you consider raising the score if we address your concerns? We look forward to your response and feedback. Thank you very much!

2024-08-14

Thanks for your willingness to open-source the code, which will be valuable for this community. I'd like to increase my score. The authors are strongly encouraged to add the discussion to the revision. I am also curious about the comparison between LION and a concurrent preprint voxel-based mamba (https://arxiv.org/abs/2406.10700). I believe a discussion in the revision will make their differences clearer and improve the quality of the manuscript. Good luck.

2024-08-14

Thank you for this reference (made public on 15 Jun 2024 in Arxiv, after the submission of NeurIPS), which is also a good paper. We will cite it and add a discussion in our related work in the final version. Thanks for your valuable suggestion again!

审稿意见

评分: 6置信度: 42024-07-09

This paper proposes a linear Group RNN-based backbone for 3D object detection tasks. It can achieve a larger window size compared to previous transformer-based methods. A 3D spatial feature descriptor is also introduced to capture 3D spatial information. Furthermore, to address the sparsity of point clouds, this paper leverages the auto-regressive property of RNNs to generate voxels for distinguishing foreground features. Experiments on the large-scale nuscenes and waymo datasets validate the effectiveness, as well as the generalization across different linear compositional RNN operators such as Mamba, RWKV, and RetNet.

优点

1.While transformer-based backbone networks have demonstrated superior performance, their quadratic complexity has limited their application scenarios. This paper explores the potential of linear group RNNs as feature extraction backbones for 3D detection tasks, and presents state-of-the-art performance on large-scale datasets, which is interesting. 2. The proposed method has been shown to be effective across multiple linear group RNN operators such as Mamba, RWKV, and RetNet, demonstrating the generalization ability. 3. The paper is well-written, with a clear expression of the motivation and is easy to read.

缺点

1.In this paper, the author has transformed the irregular point cloud into regular voxel representation. However, L172-175，the author claim max or average-pooling operations is not suitable to achieve downsampling or upsampling operations. This is conflicted. Furthmore, I think the motivation provided for the voxel generation approach is not sufficient, making it difficult to accept the motivation behind the proposed voxel generation method.

2.The author should report the test set results on the waymo dataset of the experimental results. In addition, the authors should also report the results on the multi-frame setting on the waymo dataset to prove the effectiveness of the method in the multi-frame setting. I want to the know the test set results on waymo dataset and multi-frame setting results.

3.The authors should report the inference time of LION so that we have a clearer understanding of the latency and resource consumption of the proposed method.

4.When will release the code for the community?

问题

L124, I’m not clear about the Window Partition. As far as I understand, non-empty voxels are extracted line by line into the window along the X or Y axes. Every time a sequence of 4096 lengths is filled, the remaining non-empty voxels are continued to be placed in the next window. So, how do you deal with it if the final window is not enough for 4096 voxels. Regarding the division and description of windows, it should be clearer. The current description is difficult for me to understand.

Other questions pleae see the weakness. Will raise up the score after solving my concerns.

局限性

The author claim the corresponding limitation in the paper.

作者回复

2024-08-07

W1-1: In this paper, the author has transformed the irregular point cloud into regular voxel representation. However, L172-175, the author claim max or average-pooling operations is not suitable to achieve downsampling or upsampling operations. This is conflicted.

Sorry for this misleading! Since the distribution of voxels in 3D space is sparse, the regular max or average pooling operations on a dense 2D image are not suitable for performing voxel downsampling or upsampling.

W1-2: Furthmore, I think the motivation provided for the voxel generation approach is not sufficient, making it difficult to accept the motivation behind the proposed voxel generation method.

Our motivation of voxel generation has two folds: 1) Voxel generation can densify some key voxel features to enhance the feature representation in highly sparse point clouds; 2) Voxel generation can mitigate the information loss from voxel merging operation, which is an effective operation to reduce the computation cost in our LION. We will make them clearer in the revised version.

W2: The author should report the test set results on the waymo dataset of the experimental results. In addition, the authors should also report the results on the multi-frame setting on the waymo dataset to prove the effectiveness of the method in the multi-frame setting. I want to the know the test set results on waymo dataset and multi-frame setting results.

Thanks for your nice suggestion. We provide the results with 3 frames on the val set in the following table. It could be found that our LION-Mamba-L even outperforms DSVT with 2.2 mAPH/L2, which effectively illustrates the superiority of our LION.

Method	Frames	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
SST	3	68.5/68.1	75.1/70.9	-/-	-/-
DSVT	3	73.6/73.2	78.2/75.4	77.2/76.4	76.3/75.0
LION-Mamba-L	3	73.9/73.5	81.0/78.3	80.7/79.8	78.5/77.2

Furthermore, for submitting Waymo test benchmark, it is a common trick to unify the training and validation datasets to train the model for better performance. Therefore, we need to reorganize our dataset and re-train our model for the test set. Considering the limited time, we train our LION-Mamba-L with 3 frames for only 12 epochs to save the training time. Our LION-Mamba-L achieves state-of-the-art performance on Waymo test dataset, whose corresponding results are as follows (We also provide the screenshot for results on the Waymo official website with 3 frames in the uploaded global PDF file):

Method	Epochs	Frames	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
CenterPoint++	30	3	75.5/75.1	75.1/72.4	72.0/71.0	74.2/72.8
PillarNeXt	36	3	76.2/75.8	78.8/76.0	71.6/70.6	75.5/74.1
LION-Mamba-L	12	3	77.2/76.9	82.0/79.3	76.8/75.9	78.7/77.4

W3: The authors should report the inference time of LION so that we have a clearer understanding of the latency and resource consumption of the proposed method.

Thanks for your suggestion. We provide the detailed latency of our LION-Mamba and LION-Mamba-L inference time in the following table. We evaluate the latency on one NVIDIA GeForce RTX 3090 with a batch size of 1. For a more detailed discussion of the Transformer-based method DSVT in terms of latency and performance, please refer to W4 of Reviewer 9fJS.

Method	Voxel Extraction (ms)	3D Backbone (ms)	BEV Backbone (ms)	Detection Head (ms)	Total Latency (ms)	mAPH (L2)
LION-Mamba	4.3	97.1	27.1	17.7	146.2	73.2
LION-Mamba-L	5.6	136.0	32.2	17.5	191.3	74.0

W4: When will release the code for the community?

We will release all codes and models by September 30, 2024.

Q1: L124, I’m not clear about the Window Partition. As far as I understand, non-empty voxels are extracted line by line into the window along the X or Y axes. Every time a sequence of 4096 lengths is filled, the remaining non-empty voxels are continued to be placed in the next window. So, how do you deal with it if the final window is not enough for 4096 voxels. Regarding the division and description of windows, it should be clearer. The current description is difficult for me to understand.

We apologize for the unclear presentation. For the final window that is not enough for 4096 voxels, we repeat these remained voxels in this window up to 4096 voxels. We will make it clearer in the revised version.

2024-08-13

Thanks for the author's rebuttal. My concerns have been solved and I raise up my score. Release code will be great for the community.

2024-08-13

Sincerely thank you for your valuable comments again! We will open source all the codes for this community.

审稿意见

评分: 4置信度: 52024-07-12

This paper targets the problem of long-range feature interactions for point cloud detection. It proposes a window-based 3D backbone based on linear group RNN and sparse convolution. In contrast to existing Transformers methods, this work increases the group size by leveraging the linear complexity of recent Mamba and RWKV. The paper also introduces the 3D spatial feature descriptor to capture 3D local spatial information and 3D voxel generation strategy to address the sparsity of point clouds. Experiments demonstrate the efficacy of this proposed method.

优点

$\bullet$ The problem studied in this paper is important, as the long-range relationship is critical in point cloud detection.

$\bullet$ LION-Mamba achieves strong performance with low GFLOPs on widely used outdoor datasets, including the Waymo Open Dataset and nuScene.

$\bullet$ The proposed 3D spatial feature descriptor and voxel generation are simple and easy to follow.

缺点

$\bullet$ [Novelty] My main concern is the overall limited technical contribution. The paper does not show significant differences from previous works. Additionally, it lacks a thorough discussion comparing it with existing approaches.
$\qquad$ (1) Model Structure. The proposed LION block uses the same encoder-decoder structure as SED block in HEDNet [1]. Furthermore, the 3D spatial feature descriptor is identical to SSR block, and the voxel merging and expanding operations are similar to RS conv. Besides, the LION layer has the same structure with DSVT block. This paper appears to merely integrate the DSVT block into SED block and replace Transformers with linear RNNs.
$\qquad$ (2) Window Partitions. The equal-size window partition along X/Y-axis has been widely adopted in voxel-based detectors, such as Flatformer and DSVT.
$\qquad$ (3) Voxel Generation: The way to “distinguishing foreground voxels without supervision” has been widely used in point cloud detection, such as SPSS-Conv [2] and VoxelNeXt [3].
$\bullet$ [Latency] While there is some analysis of computation cost (GFLOPs), it lacks a comparison of latency with state-of-the-art algorithms, such as HEDNet and DSVT. For 3D object detection in outdoor scenarios like autonomous driving, the real-time application makes the latency more critical. Besides, in Figure 1, the comparison with LION-Mamba-L is missing.
$\bullet$ In line 42-44, the authors claim LION can support thousands of voxel features to establish long-range relationships. However, in Line 153-157, this paper proposes the spatial information loss issue and need additional local sparse convolution to address. In my point of view, a sequence of thousands based on window partitions must include the voxel and all its neighbors. This inconsistency raises concerns about the robustness and effectiveness of LION.
$\bullet$ The linear RNN, such as Mamba, is a unidirectional model. Is it reasonable to use two single-layer RNNs in a LION Layer? Will the bidirectional Mamba, such as Vision Mamba, enhance performance?
$\bullet$ [Motivation] The X/Y-axis window partition is proposed to address the feature interactions problem with group size limitation. The motivation to use X/Y-axis window partition is not clear, given the large group size.
$\bullet$ [Motivation] The motivation for choosing only four different offsets in voxel generation with auto-regressive property is unclear. A voxel has many neighbors in 3D space, so why did the author choose only these four offsets? Further explanation is needed.

[1] HEDNet: A hierarchical encoder-decoder network for 3d object detection in point clouds. NIPS’23
[2] Spatial Pruned Sparse Convolution for Efficient 3D Object Detection. NIPS’22
[3] VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. CVPR’23

问题

Please see the weaknesses section. Additionally,
$\bullet$ Please provide a comparison with HEDNet in Waymo, including latency, accuracy, and parameters.
$\bullet$ It would be better to illustrate visualization results to support the claim LION can model long-range dependencies.
$\bullet$ What is the performance of replacing the LION Layer with DSVT block (use the setting in the original paper) in the LION Block? Moreover, please provide a comparison of the latency between the two variants.

局限性

Limitations have been included

作者回复

2024-08-07

Thanks for your careful review and valuable suggestions. Here, we further address our contributions in this paper:

Linear RNN-based 3D detection framework: support kinds of linear RNNs (e.g., Mamba, RWKV, RetNet) to allow long-range feature interaction.
3D spatial feature descriptor: compensate for the lack of linear RNNs in capturing local 3D spatial information.
Voxel generation: utilize auto-regressive property of the linear RNN for voxel diffusion and obtain a more discriminative feature representation in highly sparse point clouds.

[Novelty]

Model Structure.
- The encoder-decoder structure is not our contribution and we adopt it to reduce the computational costs while maintaining superior performance. We provide the experiments in the following table.
Method Latency (ms) Vehicle Pedestrian Cyclist mAP/mAPH (L2)
LION-Mamba(w/o encoder-decoder structure) 180.1 66.8/66.4 75.1/70.2 72.1/71.1 71.3/69.2
LION-Mamba 146.2 67.0/66.6 75.4/70.2 71.9/71.0 71.4/69.3
- Although the 3D spatial descriptor consists of a common and simple spconv-based module, it is crucial to help our linear RNN-based network to better capture 3D local spatial information. Therefore, the combination of 3D spatial descriptor and linear RNN operators could make each voxel feature perceive the local spatial information and long-range relation, which is important to improve detection performance ((Refer to the Table 3 and 4 in the main paper)).
- The structure of LION layer is a widely used structure (e.g., Swin Transformer, FlatFormer, and DSVT). We keep consistent with DSVT for convenience.
Window Partition.

We follow the previous work FlatFormer (Refer to L130-133 in the main paper). Besides, we do not claim that the equal-size window partition is not our contribution.
Voxel Generation.

Thanks for your valuable feedback. We would like to emphasize that our primary contribution in voxel generation lies in the first attempt of the auto-regressive capacity of the linear RNN. The part of “distinguishing foreground voxels without supervision" only serves it. We appreciate your point regarding the distinction of foreground voxels without supervision, and we will revise this part in the final version.

Method	Latency (ms)	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
LION-Mamba(w/o encoder-decoder structure)	180.1	66.8/66.4	75.1/70.2	72.1/71.1	71.3/69.2
LION-Mamba	146.2	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3

[Latency]

Thanks. We provide the comparison of different methods. We evaluate the latency on single 3090 with a batch size of 1. LION is an early attempt to adopt linear RNN to 3D object detection, and the running time needs to be optimized compared with the highly-optimized operator (e.g., sparse conv). We will improve the speed in future work for the real-time application. Please refer to the limitation (L323-325) in the main paper. For Figure 1 in the main paper, we will revise it in the final version. For a more detailed discussion of the Transformer-based method DSVT, please refer to W4 of Reviewer 9fJS.

Method	mAPH (L2)	Latency (ms)	params (M)
HEDNet	73.4	74.8	11.9
DSVT	72.1	136.7	8.7
LION-Mamba	73.2	146.2	8.6
LION-Mamba-L	74.0	191.3	16.1

[Motivation]

Motivation to use X/Y-axis window partition.

For linear RNN models, different sequence arrangements lead to varying feature interactions. More arrangements can improve feature richness, so we use X/Y-axis window partitioning to generate diverse sequences.
Why did the author choose only these four offsets?

Setting more different offsets usually results in producing more voxels, which brings more computation costs for the following LION layers. To reduce computation cost, we only consider the offset of each voxel for diffusion in BEV space and ignore the offset along the Z axis. Here, we simply choose these four offsets along the two diagonals in BEV space.

[Weakness]

A sequence of thousands based on window partitions must include the voxel and all its neighbors ...

Although a sequence with thousands of voxels usually contains the voxel features of their neighbors, two adjacent voxels might be far apart in this sequence (please refer to Figure 4 in the main paper) since the linear RNN extract features along the sequence. Therefore, we adopt an additional local sparse convolution (or 3D spatial feature descriptor) to address this problem.
Bidirectional modeling.

Sorry for missing this detail! In this paper, all linear RNNs (Mamba, RWKV and RetNet) adopt the bidirectional manner for better feature interaction. We will add the details in the revised version.

[Question]

Comparison with HEDNet

Thanks. We provide the comparison with HEDNet in [Latency].
Visualization of long-range dependencies.

Thanks. We provide the visualization for long-range dependencies in Figure 1 in our uploaded PDF file.

Performance of replacing the LION Layer with DSVT block.

We replace the LION layer with DSVT block to validate the effectiveness of LION, as shown in the following table. * represents the results without our 3D spatial feature descriptor and voxel generation. All models are trained with 20% data and 12 epochs. We report the official results of DSVT (I). We can observe that the performance of integrating DSVT into LION (II and IV) is lower than LION (III and V). Finally, the manners with our 3D spatial feature descriptor and voxel generation (IV and V) produce much better performance than the methods (II and III), which effectively demonstrates the importance of our proposed components.

#	Method	Latency (ms)	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
I	DSVT	136.7	67.2/66.8	72.5/66.4	70.1/69.1	69.9/67.4
II	LION-DSVT*	122.7	64.8/64.3	71.0/63.6	67.4/66.2	67.7/64.7
III	LION-Mamba*	123.2	66.2/65.7	73.7/67.2	68.7/67.6	69.5/66.9
IV	LION-DSVT	157.7	66.1/65.7	74.4/68.8	70.7/69.8	70.4/68.1
V	LION-Mamba	146.2	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3

2024-08-11

Dear Reviewer,

Thank you very much for your valuable reviews and comments. We have carefully addressed the concerns you raised and hope that our responses satisfactorily resolve them.

If you have any questions or need additional clarification after reading our rebuttal, please do not hesitate to let us know during the discussion period.

Thank you once again for your time and consideration.

2024-08-13

Thanks for your detailed reply. I appreciate the clarifications, which have addressed some of my initial concerns. However, after careful consideration of your responses and the manuscript, I still have significant reservations about this paper.

$\bullet$ Motivation need to be improved. Following previous Transformer-based framework, using linear RCNN in 3D detection is straightforward and simple matter. The support for linear RNN is hardly inspiring for 3D detection community.

$\bullet$ The contributions and novelty are limited to this field. In response, techniques like the encoder-decoder (HEDNet), the feature descriptor (SubmConv), window partition (FlatFormer) are already widely used in 3D detection. Moreover, increasing voxel density in 3D space has proven useful in prior work.

$\bullet$ Low efficiency. Lion shares the same structures as HEDNet, yet its latency is nearly three times higher. The increased computation complexity offers minimal performance gains (only +0.6 L2 mAPH). I have serious concerns about using linear RNN for outdoor 3D detection that demand real-time performance. Moreover, the authors have not addressed this critical issue but instead added voxel generation, which further increases the computation burden.

I suggest the authors focus on improving the efficiency of applying linear RCNN in 3D detection, rather than merely applying various types. This would be more valuable to the community. Given these considerations, I maintain my original rating.

2024-08-13

Thank you for your patient and detailed comments. We will further clarify these comments.

Motivation need to be improved.
- Linear RNN is an important operator for supporting larger group size for 3D object detection, since the long-range relationship is critical in point cloud detection.
- In fact, directly applying linear RNN (Note that we keep the same structure except for our proposed 3D spatial feature descriptor and voxel generation.) can only achieve a poor performance (66.9 mAPH/L2). When adopting the proposed 3D spatial feature descriptor and voxel generation, there is a 2.4 mAPH/L2 improvement, which demonstrates the effectiveness of our contribution. * represents the results without our 3D spatial feature descriptor and voxel generation.
# Method Latency(ms) Vehicle Pedestrian Cyclist mAP/mAPH(L2)
I LION-Mamba* 123.2 66.2/65.7 73.7/67.2 68.7/67.6 69.5/66.9
II LION-Mamba 146.2 67.0/66.6 75.4/70.2 71.9/71.0 71.4/69.3
The contributions and novelty are limited to this field.
- The encoder-decoder and window partition are not our contirbution.
- Although the 3D spatial descriptor is a simple SubmConv, it is crucial for our linear RNN-based network to better capture 3D local spatial information, which is very important for how to make good use of linear RNN.
- Increasing voxel density in 3D space is a common problem. We would like to emphasize that our primary contribution in voxel generation lies in the first attempt of the auto-regressive capacity of the linear RNN and achieves better performance compared with other methods.
Performance and Efficiency

Performance: The performance of Waymo is relatively saturated. We think 0.6 mAPH/L2 is a relatively large improvement. We provide the performance on the nuScenes in the main paper. Besides, we additionally provide the results on Argoverse V2 dataset in this response.
- nuScenes: LION can significantly outperform the HEDNet with 1.9 NDS. Besides, the latency gap between LION and HEDNet has further narrowed compared with the results on Waymo.
  
  Method Latency (ms) NDS mAP
  HEDNet 162.5 72.0 67.7
  LION-Mamba 183.8 73.9 (+1.9) 69.8 (+2.1)
- Argoverse V2: LION can significantly outperform the HEDNet with 4.4 mAP, leading to a new SOTA result. Besides, LION is faster than HEDNet in the large-range scenario (200m × 200m), which demonstrates the effectiveness of our LION for processing large group size point clouds. We will add the result in the revised version.
  
  Method Latency (ms) mAP
  HEDNet 192.3 37.1
  LION-Mamba 186.6 41.5 (+4.4)
Efficiency: LION is an early attempt to adopt linear RNN to 3D object detection, which achieves SOTA performance on Waymo, nuScenes, and Argoverse V2. The running time needs engineering optimization compared with the highly-optimized operator (e.g., sparse conv). We will improve the speed in future work for the real-time application.

#	Method	Latency(ms)	Vehicle	Pedestrian	Cyclist	mAP/mAPH(L2)
I	LION-Mamba*	123.2	66.2/65.7	73.7/67.2	68.7/67.6	69.5/66.9
II	LION-Mamba	146.2	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3

Method	Latency (ms)	NDS	mAP
HEDNet	162.5	72.0	67.7
LION-Mamba	183.8	73.9 (+1.9)	69.8 (+2.1)

Method	Latency (ms)	mAP
HEDNet	192.3	37.1
LION-Mamba	186.6	41.5 (+4.4)

2024-08-14

Thank you for providing additional clarification in your response.

$\bullet$ Extending group size in LION, though potentially useful, appears to be a straightforward application of existing techniques rather than a novel insight into this filed. In summary, I think the contribution of the paper does not achieve the bar of NeurIPS.
$\bullet$ The fact that improvements from 3D spatial feature descriptor and voxel generation are more pronounced in LION-DSVT compared to LION-Mamba (Question 3 of Authors’ rebuttal). This indicates that the enhancements are not uniquely tied to linear RCNN. This observation seems to contradict the claims made in the paper.
$\bullet$ The experiment results show that the latency of HEDNet on nuScenes is 2.17 times greater than on Waymo, different from our findings (about 1.4 times greater). While we acknowledge that latency across different datasets/devices is not uncommon, the reported latency between two datasets represents an unexpected gap. We hope the authors can provide a reasonable explanation for this apparent inconsistency.
$\bullet$ Additionally, LION-mamba is faster than HEDNet on Argoverse v2. What is the experiment setting of LION-Mamba on av2? I would like to clarify whether LION-Mamba and HEDNet were compared fairly, particularly regarding the sparse detection head and BEV backbone.

2024-08-14

Thanks for your comments. We need to further clarify the comments due to some misunderstandings.

Contribution: We need to clarify our contribution. Extending group size in LION is not a straightforward application of existing techniques. In contrast, effectively applying linear group RNN to 3D object detection in highly sparse point clouds is not trivial due to its limitation in handling spatial modeling. To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features. To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features thanks to linear group RNN as a natural property of auto-regressive models. For more details and in-depth analysis, please refer to the section of method in main paper.

Effectiveness: We don't think the comparison is justified. First of all, our main contribution are 3D spatial feature descriptor and voxel generation, which has been effectively demonstrated their superiority in our linear RNN-based framework LION in our ablation studies(Table 3, 4, and 5) in main paper and the experiment (Question 3 of Authors' rebuttal). Besides, to illustrate the effectiveness of LION in large group size, we provide the comparison of LION-DSVT (we keep the same small group with the original paper due to the enormous amount of computation from self-attention) and LION-Mamba (large group size) in the experiment (Question 3 of Authors' rebuttal) under the same structure. We can observe that our LION-Mamba not only outperforms LION-DSVT with mAPH L2 of 1.2 but has less latency than LION-DSVT, which verifies the effectiveness of LION in extending group size.

The latency of HEDNet on nuScenes: In fact, we use the original config of HEDNet-nusc, which is voxel-based. The new config in HEDNet codebase is pillar-based. Here, we also report the latency of pillar-based HEDNet. The latency of pillar-based HEDNet is about 1.42 times that of the result on Waymo. However, the paper of HEDNet adopts the voxel-based HEDNet for nuScenes dataset (please refer to section 4.2 in the paper of HEDNet). Therefore, we report the results of voxel-based setting.

Method	Latency (ms)
HEDNet (pillar)	106.4
HEDNet (voxel)	162.5
LION-Mamba	183.8

The latency of LION on Argoverse V2: First of all, we need to clarify that our LION is different with HEDNet. To eliminate your concerns, we provide a detailed comparison by re-reporting the results of each component. Notably, we utilize a 3D and 2D backbone composed of LION layers, which should also be considered in comparison with the 3D and 2D backbone proposed by HEDNet (HEDNet 3D Backbone and CascadeDEDBackbone). For the detection head, we employ the VoxelNeXt head. Finally, our LION outperforms HEDNet with a large margin (4.4 mAP) while keeping comparable latency. We think these results demonstrate the effectiveness of LION. We will add the more detailed results in the revised version.

Method	Voxel Extraction (ms)	3D Backbone Latency (ms)	2D Backbone Latency (ms)	Detection Head (ms)	All (ms)
HEDNet	3.0	40.4	85.0	63.9	192.3
LION-Mamba	4.7	109.7	23.3	48.9	186.6

Finally, despite some misunderstandings about our paper, we still greatly appreciate your time and valuable suggestion!

审稿意见

评分: 6置信度: 22024-07-14

This paper presents the LION block, a neural component that builds up a backbone to extract 3D features with linear group RNNs for 3D object detection. The authors introduce a 3D spatial feature descriptor to extract point features, and a novel auto-regressive voxel generation method to density the foreground feature in sparse point clouds.

优点

The presentation is clear and easy to follow.
The proposed method surpasses previous state-of-the-art methods.

缺点

Figure 3 (a) is confusing. Since the 3D Spatial Feature Descriptors are neural layers with learnable parameters, it would be better to represent these layers with blocks like “LION Layer”.
It would be beneficial to discuss the rationale for positioning the 3D Spatial Feature Descriptor after the Linear Group RNN, as placing it beforehand could potentially better preserve spatial information.
An ablation study on Voxel Merging and Expanding would help quantify their contributions to the framework's performance.

问题

Since objects-of-interest in autonomous driving is relatively small comparing to the whole scene, why long-range relation ( $K=4096, 2048, …$ ) is important (line 133 - 135) in this scenario?

局限性

The limitations are properly discussed in the main paper.

作者回复

2024-08-07

W1: Figure 3 (a) is confusing. Since the 3D Spatial Feature Descriptors are neural layers with learnable parameters, it would be better to represent these layers with blocks like “LION Layer”.

Thanks for your nice suggestion! We will revise "3D Spatial Feature Descriptors" as "3D Spatial Description Layer" to make it clearer in the final version.

W2: It would be beneficial to discuss the rationale for positioning the 3D Spatial Feature Descriptor (3D SFD) after the Linear Group RNN, as placing it beforehand could potentially better preserve spatial information.

Thanks! We provide the corresponding results of different placements as following table. For Placement 1, we place the 3D SFD after voxel merging. For Placement 2, we place 3D SFD before the voxel merging. We agree with your explanation that placing 3D Spatial Feature Descriptor beforehand could potentially better preserve spatial information. Finally, we will add this discussion of the rationale for positioning the 3D Spatial Feature Descriptor in the revised version.

Method	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
Baseline	66.4/66.0	73.5/67.4	70.4/69.3	70.1/67.6
Placement 1	66.5/66.1	74.8/69.1	71.1/70.2	70.1/68.6
Placement 2	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3

W3: An ablation study on Voxel Merging and Expanding would help quantify their contributions to the framework's performance.

Thanks! Note that removing the voxel merging and voxel expanding operations means that the input voxels of the linear group RNN are not processed by any downsampling or upsampling operations, which results in some additional computational cost. In the following table, we provide the experiment of our method with/without the voxel merging and voxel expanding operations. It can be observed that our method with adopting the voxel merging and voxel expanding operations could effectively reduce the computational costs while maintaining superior performance.

Method	Latency (ms)	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
LION-Mamba (w/o Voxel Merging and Expanding)	180.1	66.8/66.4	75.1/70.1	72.1/71.1	71.3/69.2
LION-Mamba	146.2	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3

Q1: Since objects-of-interest in autonomous driving is relatively small comparing to the whole scene, why long-range relation (K=4096, 2048, ...) is important (line 133 - 135) in this scenario?

Good question! In fact, capturing the long-range relationship is helpful to obtain richer context information for better understanding the whole scene. Therefore, it is not only important for detecting large objects, but also for detecting small objects.
In this paper, we build the long-range relationship by the linear group RNN operations. We provide the ablation studies of different group sizes in the following table. Here, we set a minimum group size of 256 for all our four LION blocks (Baseline: [256, 256, 256, 256]). We could observe that the manners with larger group size (i.e, II, III, IV, V) bring consistent performance improvement over the baseline (I). However, we carefully find that there is a drop in performance from IV to V by further enlarging the group size. This might lead to less effective retention of important information in excessively long sequences due to the limited memory capacity of linear RNNs. Finally, to better understand the long-range relationship, we provide the visualization in Figure 1 (Please refer to our uploaded PDF file). We will add these discussion and experiments in the final version.

#	Group Size	Vehicle	Pedestrian	Cyclist	mAP/mAPH (L2)
I	[256, 256, 256, 256]	65.6/65.2	72.3/65.0	68.3/67.2	68.8/65.8
II	[1024, 512, 256, 256]	66.9/66.5	74.9/69.6	70.8/69.8	70.9/68.6
III	[2048, 1024, 512, 256]	66.7/66.3	74.9/69.7	72.2/71.2	71.3/69.1
IV	[4096, 2048, 1024, 512] (Ours)	67.0/66.6	75.4/70.2	71.9/71.0	71.4/69.3
V	[8192, 4096, 2048, 1024]	66.5/66.1	74.6/69.5	71.6/70.6	70.9/68.7

作者回复

2024-08-07

We are grateful for the valuable suggestion and feedback of all reviewers, which will greatly improve the quality of our paper. We will carefully revise our paper according to your suggestions.

To Reviewer ehGN and wvps: We provide the visualization of long-range dependencies in the uploaded PDF file.

To Reviewer 5dKF: We provide the anonymous screenshot for testing results on the Waymo official website in the uploaded PDF file.

评论- A brief summary

2024-08-13

Dear AC and Reviewers,

Thank you for taking the time to review our manuscript. We are pleased to see that the reviewers have recognized the strengths of our work. Specifically, all reviewers (Reviewers ehGN, wvps, 5dKF and 9fJS) appreciate the performance of our LION in realizing superior performance on Waymo and nuScenes. In addition, Reviewer wvps considers the problem we have studied in this paper as important since the long distance relation is critical in point cloud detection. Reviewer 5dKF finds the SOTA performance of LION achieved based on Linear group RNN as very interesting. Reviewer 9fJS thinks that LION is an early attempt of linear group RNN for 3D detection. Finally, both Reviewer 5dKF and Reviewer 9fJS think that our method proves the generality of the LION by supporting different linear operators (RetNet, RWKV, Mamba).

However, some reviewers also have some concerns:

Reviewer ehGN has concerns why long-range relation is important.
Reviewer wvps has concerns about novelty.
Reviewer 5dKF hopes that the authors should report the results of our LION with multi-frames on the Waymo validation and test set.
Reviewer 9fJS thinks that the ablation for larger group size should be provided since it is a main claim

We have provided detailed responses to these concerns. Thank you again for your time and consideration!

Best regards,

All authors

最终决定Accept (poster)

2024-09-25

The paper introduces a linear Group RNN-based backbone for 3D object detection, offering an advantage in achieving a larger window size over previous transformer-based methods. It initially received three positive and one negative scores, with the major concerns including the importance of long-range relations, the novelty of the approach, the need for results on multi-frame data from the Waymo dataset, and the necessity of ablation studies for larger group sizes. After the authors' careful response and multi-round discussion, most concerns were addressed satisfactorily, and most of the reviewers reached a consensus of acknowledging the paper's value to the community.

Therefore, the meta-reviewer believes the work meets the conference's standards and recommends it for acceptance.