6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.8

置信度

正确性3.0

贡献度3.0

表达3.0

NeurIPS 2024

RETR: Multi-View Radar Detection Transformer for Indoor Perception

Ryoma Yataka,Adriano Cardace,Pu Perry Wang,Petros Boufounos,Ryuhei Takahashi

OpenReview PDF

提交: 2024-05-16更新: 2025-01-17

TL;DR

We propose RETR for multi-view radar indoor perception, improving object detection and segmentation, outperforming current methods on indoor radar datasets.

摘要

关键词

indoor monitoringradar perceptionobject detectioninstance segmentationdetection transformermulti-view radar heatmaps

评审与讨论

审稿意见

评分: 6置信度: 42024-06-16

This work introduces a novel method RETR for indoor object detection and segmentation based on multi-view radar heatmaps. RETR extends the popular DETR framework and incorporates modifications specific to multi-view radar perception, such as depth-prioritized feature similarity via TPE, a trip-plane loss, and a learnable radar-to-camera transformation. Experiments on two datasets demonstrate the effectiveness of RETR for object detection and segmentation.

优点

This work utilizes multi-view radar heatmaps as input for indoor human perception, which has broad application for privacy-aware indoor sensing and monitoring.
This work extends image-based DETR to multi-view radar-based RETR, establishing a new baseline for radar-based human detection and segmentation. The proposal of tunable positional embedding is interesting. It can enable the adjustment of different axes for positional embeddings.
The experiments are comprehensive with two tasks validated on two tasks. The details of hyperparameters and implementation details are adequate, making this paper a good reference for future works.

缺点

Most commercial radars have 2D virtual antenna arrays, e.g., 16*8, instead of only a pair of 1D antenna arrays. Is there any specific reason for considering such kinds of radar antenna arrays? Normally, the 2D heatmaps (range-elevation, range-azimuth) are generated via projecting the 3D radar cube data in two views. The explanation in Section 2 does not fit the real process of multi-view radar heatmap generation. If this work considers perception with such a unique radar antenna array, better to introduce what types of radar the dataset used.
In the experiments, only radar heatmap-based methods are used for comparison. Better to incorporate the results of radar point cloud-based and camera-based approaches for a more complete comparison. The survey of related work is also not complete, ignoring recent works in radar-based object detection for autonomous driving [1-3], which also use radar heatmaps: [1] Liu, Yang, et al. "Echoes beyond points: Unleashing the power of raw radar data in multi-modality fusion." Advances in Neural Information Processing Systems 36 (2024). [2] Paek, Dong-Hee, Seung-Hyun Kong, and Kevin Tirta Wijaya. "K-radar: 4d radar object detection for autonomous driving in various weather conditions." Advances in Neural Information Processing Systems 35 (2022): 3819-3829. [3] Skog, Mikael, et al. "Human Detection from 4D Radar Data in Low-Visibility Field Conditions." arXiv preprint arXiv:2404.05307 (2024).

问题

At line 28, it is better to explain why radar heatmaps are more preferred for challenging perception tasks. At line 31, it's not necessary to mention if one work is publicly accessible for the introduction section.
The relationship and different between previous works in radar-based detection and segmentation, e.g., RFMask and RETR is not well explained.
Better to show the tokenization step in Figure 3 to aid in understanding the pipeline.
In line 135, equation 3 should be Equation 3.
How did the author implement the top-K feature selection step? And how to supervise such a selection step to ensure the selected features are significant for the detection task?
In line 229, what's the reason why the author claimed that the calibration may be only accurate for a limited interval of depth and angles.
The MMVR dataset can not be found online.
Please provide captions for table 1 and 2 to increase the readability.

局限性

NA.

作者回复

2024-08-07

Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below. Due to space constraints, we have shortened your comments for brevity.

Most commercial radars have 2D virtual antenna arrays, e.g., 16*8, instead of only a pair of 1D antenna arrays. Is there any specific reason for considering such kinds of radar antenna arrays?

Thanks for raising this point. The main reason for using a configuration of two cascading radars in horizontal and vertical orientations is to harness finer angular resolution in both azimuth and elevation domains to support all three perception tasks, especially for pixel-level segmentation. Both HIBER and MMVR datasets used this configuration.

Most commercial radars have a typical configuration of $4$ Rx and $3$ Tx antennas, yielding a virtual array of $8$ elements in one angular dimension and $4$ in the other. Examples include TI's IWR1443 and IWR1843 chipsets and, more recently, NXP's TEF81xx and TEF82xx series. This usually leads to an angular resolution of $15^\circ$ in one angular dimension and about $30^\circ$ in the other angular dimension.

The configuration of two cascading radars ( $12$ Tx and $16$ Rx) yields a virtual array of $86$ non-overlapping half-wavelength-spaced elements in both vertical and horizontal dimensions, offering an angular resolution of $1.3^\circ$ , more than $10 \times$ better. The resulting high-resolution multi-view radar heatmaps can provide fine-grained radar features and support not only BBox estimation and pose estimation but also the more challenging pixel-level segmentation.

The explanation in Section 2 does not fit the real process of multi-view radar heatmap generation. Better to introduce what types of radar the dataset used.

Due to space constraints, the current version of Section 3 - Generation of Radar Heatmaps presents an abbreviated description of the full process by skipping steps such as MIMO waveform separation for virtual array processing, integration over the Doppler domain, and projection onto the azimuth and elevation domains. In the updated paper, we plan to include a new section in the Appendix to introduce the dual (horizontal-vertical) radar configuration and provide a detailed explanation of the generation of the two radar heatmaps.

In the experiments, only radar heatmap-based methods are used for comparison. Better to incorporate the results of radar point cloud-based and camera-based approaches for a more complete comparison.

Thanks for the suggestion. We have two response points:

Dataset: To the best of our knowledge (see Table 4 in the PDF of the global response), we cannot find an indoor radar dataset with both radar heatmap and point cloud formats for evaluating a given method (either RETR or a baseline).
Baseline: We have included DETR, a camera-based method, in our baseline comparisons. We will make an effort to incorporate a radar cloud-based baseline in the updated paper.

The survey of related work is also not complete, ignoring recent works in radar-based object detection for autonomous driving [1-3], which also use radar heatmaps: [1] Liu, Yang, et al... [2] Paek, Dong-Hee, ... [3] Skog, Mikael....

Thanks for pointing out these recent radar datasets featuring radar heatmaps. We will include them in the updated paper. Specifically, we plan to discuss [3] in Section 2 - Related Work, as it is particularly relevant to our task of human perception using radar heatmaps.

Line 28: better to explain why radar heatmaps are more preferred for challenging perception tasks.

In the updated paper, we plan to highlight the difference between radar heatmaps and point clouds. We will also explain how high-resolution, multi-view radar heatmaps may be better suited for supporting more challenging perception tasks, such as pixel-level segmentation.

The relationship and difference between previous works in radar-based detection and segmentation...is not well explained.

In Fig. 2 of the main paper, we intended to highlight the major differences between RFMask and the proposed RETR. RFMask uses regional proposals and features from only the horizontal radar view with a fixed height. RETR, on the other hand, employs a detection transformer, fuses multi-view radar features, and exploits the unique multi-view radar setting via TPE and radar-to-camera coordinate transformation. We will expand the RFMask with Refined BBoxes section in the Appendix to clarify these differences further.

Better to show the tokenization step in Figure 3....

see Fig. 2 of the PDF in the global response.

implement the top-K feature selection? And how to supervise...?

For the Top- $K$ selector, we just select feature tokens with the highest norms computed over the channel dimension. We observed that direct supervision is unnecessary for this step because radar features tend to be extremely localized. Please refer to Fig. 6 of the main paper for visualizing the cross-attention between selected features and detected BBoxes.

Line 229, why the author claimed that the calibration may be only accurate for a limited interval of depth and angles.

This is mainly due to the varying cross-range radar resolution over depth and angles as the radar operates in a polar coordinate system. For a given angular resolution, the cross-range cell resolution at $3$ meters will be about $3\times$ larger than that at $1$ meter. To compensate for such resolution differences, one may need to repeat the calibration for different depth/angular intervals.

The MMVR dataset can not be found online.

We reached out to the MMVR authors who provided us with the "P2" split at the time of submission. The MMVR dataset should be available now by searching "MMVR Dataset", although we are not allowed to share any links in the rebuttal.

Minor questions: Line 31:... Line 135: .....provide captions for table 1 and 2

We will make the changes accordingly.

2024-08-12

Thank the authors for providing such detailed response to my comments. I am looking forward to see more explanantion regarding the radar configuration and generation of the radar heatmaps in the revised version. Regarding perception methods based on radar heatmap, we find two more recent works: [1] Kong, Seung-Hyun, Dong-Hee Paek, and Sangyeong Lee. "RTNH+: Enhanced 4D Radar Object Detection Network using Two-Level Preprocessing and Vertical Encoding." IEEE Transactions on Intelligent Vehicles (2024). [2] Ding, Fangqiang, Xiangyu Wen, Yunzhou Zhu, Yiming Li, and Chris Xiaoxuan Lu. "RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar." arXiv preprint arXiv:2405.14014 (2024). Hope the citation of them could improve the rigorousness of your related work.

Overall, thank you for your efforts in addressing my concerns, and I have decided to adjust my recommendation to a score of 6.

2024-08-12

We sincerely appreciate the reviewer's time and effort in reviewing our rebuttal, and we're delighted to see that it has been positively received. We will certainly consider the suggested references to update our related work. Additionally, we kindly encourage the reviewer to update the score to reflect the stated intentions.

审稿意见

评分: 6置信度: 22024-07-12

The primary content of this paper is an introduction to a multi-view radar detection transformer algorithm (RETR) for indoor perception. The algorithm achieves effective object detection in indoor environments by utilizing multi-view radar data and combining self-attention mechanisms and cross-attention mechanisms. The author validates the performance improvement of the algorithm through experiments and discusses its application in indoor perception.

优点

1.The idea is helpful for the field of view indoor multi-view radar perception. 2.The paper is clear and easy to follow. 3.Rigorous ablation studies were conducted, providing evidence of the proposed method's efficacy. 4.RETR, based on the original DETR, has achieved significant performance improvements on indoor multi-view radar datasets with heatmap input.

缺点

1.Radar perception datasets primarily utilize point clouds as the data form. Methods based on heatmaps for indoor multi-view radar perception are not yet sufficient. The author mainly compares with customized DETR and RFMask, but the comparison is insufficient. Have the authors considered conducting further comparison experiments on the HUPR [1] dataset? If the authors can further demonstrate the method's generalization ability, I would consider giving a higher score. 2.The method presented is an improvement on the DETR and combines existing methods in a novel way. While innovative, it's not highly original for NeurIPS. 3.Research on indoor multi-view radar perception using heatmaps is relatively scarce. The author is unable to provide source code, potentially limiting contributions to this field.

[1] Lee, Shih-Po, et al. "Hupr: A benchmark for human pose estimation using millimeter wave radar." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.

问题

1.Could you provide a detailed description of the design of the Top-k selector? 2.Can you provide a more detailed analysis of the impact of the proposed RETR on real-time performance?

局限性

作者回复

2024-08-07

Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below.

Radar perception datasets primarily utilize point clouds as the data form. Methods based on heatmaps for indoor multi-view radar perception are not yet sufficient. The author mainly compares with customized DETR and RFMask, but the comparison is insufficient. Have the authors considered conducting further comparison experiments on the HUPR [1] dataset? If the authors can further demonstrate the method's generalization ability, I would consider giving a higher score.

Borrowing Table 1 from the MMVR paper, also listed as Table 4 in the PDF of the global response, you are correct that indoor radar perception datasets primarily use point clouds. However, an increasing number of datasets employ multi-view radar heatmaps to support more diverse perception tasks: RF-Pose, HuPR, HIBER, and MMVR, with the latter three collected since 2022. RF-Pose and HuPR have a resolution of $15^\circ$ in the two-view radar heatmaps, while HIBER and MMVR offer a finer resolution of $1.3^\circ$ .

Thanks for pointing out the HuPR dataset. First, HuPR focuses on pose estimation (keypoints), while our work targets object detection (BBox estimation) and segmentation (pixel-level masks). To utilize the HuPR dataset for training and evaluating our RETR pipeline, we would need to extract bounding box and segmentation labels from each HuPR frame which, unfortunately, cannot be completed within the rebuttal period.

Second, even with BBox and mask labels from HuPR, our RETR pipeline requires geometric information about the radar and camera coordinate systems, including the radar-to-camera transformation (rotation matrix and translation vector) and the 3D-to-2D camera projection matrix (pinhole camera model). Both HIBER and MMVR datasets provide this calibrated information. While our learnable radar-to-camera transformation can reduce some geometry dependency, the 3D-to-2D projection matrix is still necessary. We have reached out to the HuPR authors for this additional geometric information. Once we receive it, we will report the RETR performance on at least BBox estimation.

Third, we believe that the proposed RETR pipeline and its evaluation over two separate datasets (HIBER and MMVR) have sufficiently demonstrated the generalization capability. It is noted that the radar-to-camera geometries in these datasets are completely different but known (via calibration or learning). RETR is purposely designed to handle these differences by incorporating calibrated/learnable radar-to-camera transformations, 3D-to-2D projections, and tri-plane loss functions.

The method presented is an improvement on the DETR and combines existing methods in a novel way. While innovative, it's not highly original for NeurIPS.

We believe our contributions to be novel and original as they are not a simple combination of existing methods. Indeed it is true that we build upon DETR, yet we propose several contributions that make our RETR unique. First, we highlight how we adapt DETR to the multi-view scenario thanks to the self-attention mechanism combined with the Top-K selection and by reusing the cross-attention mechanism to avoid traditional object-to-multi-view-feature association.

Second, we introduce a depth-prioritized feature similarity via a tunable positional embedding (TPE), incorporating a crucial inductive bias of shared depth across the two radar views to enhance downstream tasks.

Third, we propose a tri-plane loss from both radar and camera coordinate systems which, to the best of our knowledge, has never been considered for object detection from radar heatmaps.

Research on indoor multi-view radar perception using heatmaps is relatively scarce. The author is unable to provide source code, potentially limiting contributions to this field.

We plan to release our code after the paper acceptance.

Question 1: Could you provide a detailed description of the design of the Top-k selector?

Regarding the Top- $K$ selection, we begin with the features map extracted from the shared backbone. Each cell in the feature map can potentially be considered as a patch/token for input into the subsequent transformer encoder. To alleviate the time complexity of the attention module, we select only the tokens with the highest norms computed over the channel dimension. In this way, we propagate only the most relevant information to the following modules and increase the inference performance, while keeping the complexity low.

In Fig. 6 of the main paper, we visualize selected top features (and their locations in the radar heatmap) in the two radar views that are used to detect bounding boxes in the image plane by inspecting the cross-attention module.

In Table 1 of the PDF in the global response, we have included an additional ablation study examining the impact of $K$ on detection performance.

Question 2: Can you provide a more detailed analysis of the impact of the proposed RETR on real-time performance?

We report the inference time in Table 3 of the PDF in the global response. We used an NVIDIA A40 GPU for evaluation. RETR has a comparable inference time as RFMask ( $20.89$ ms against $23.75$ ms of RETR). We compute the average inference time across all radar frames in the test data.

2024-08-12

Thank you for your response and the detailed clarification of my comments. The explanation regarding the use of the HuPR dataset and the challenges faced in obtaining labels such as Bounding Boxes is indeed insightful. This complexity could potentially limit the scope of your current evaluation, especially during the rebuttal period. Your explanation of the Top-K selection mechanism and the related experiments on inference time further strengthen the argument for the effectiveness of RETR.

Overall, I believe your research contributions are significant and closely related to the field. The decision to open-source the code will have a positive impact on the research community for radar perception tasks. Thank you for your efforts in addressing my concerns, and I have decided to adjust my recommendation to a score of 6.

2024-08-13

审稿意见

评分: 7置信度: 52024-07-13

In this paper, the authors propose Radar dEtection TRansformer (RETR), an extension of the popular DETR architecture, tailored for multi-view radar perception. RETR inherits the advantages of DETR, eliminating the need for hand-crafted components for object detection and segmentation in the image plane.

优点

Two radars with different directions are deployed. Transformer is applied for segmentation.

缺点

For indoor perception with radar, multi-path is expected. It would be good to add a paragraph to discuss this issue and how to mitigate multi-path or its impact to the perception/segmentation performance.

The motivation of applying transformer for indoor segmentation is not well discussed. What are the benefits and unique challenges of applying transformer to radar based indoor segmentation?

For the introduction of Generation of Radar Heatmaps, please consider cite the following paper: S. Sun, A. P. Petropulu and H. V. Poor, "MIMO Radar for Advanced Driver-Assistance Systems and Autonomous Driving: Advantages and Challenges," in IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 98-117, July 2020.

问题

Is there any interference (cross-path signals) between the two radars deployed in vertical and horizontal directions?

局限性

Usually a large amount of high quality radar data is required to train the transformer. It is highly recommended to carry out validation with different amount of training data.

作者回复

2024-08-07

Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below.

Thanks for the insightful comment. We agree that multi-path reflections from the ground, ceiling, and other strong scatterers (e.g., metal) can cause (first-order or second-order) ghost targets and elevate the noise floor. One way to address this issue is to employ classical signal processing techniques into the radar heatmap generation to remove these ghost targets and the static background reflection.

We may address this issue directly in the end-to-end radar perception pipeline by labeling these ghost targets in standard radar heatmaps (although it is difficult and costly to have these labels) and directly classifying RETR object queries to one of { $\{\emptyset, person, ghost\}$ }, alongside regressing queries to the bounding box parameters. As you suggested, we will add a paragraph to discuss the multi-path issue and potential ways to mitigate its impact.

The motivation of applying transformer for indoor segmentation is not well discussed. What are the benefits and unique challenges of applying transformer to radar based indoor segmentation?

We agreed that in the main paper, we primarily used the object detection example to motivate the RETR, while the segmentation part was deferred to Appendix B Segmentation. In the updated paper, we will emphasize that the segmentation is an integral part of the RETR by highlighting the following points in the main paper:

The segmentation head uses the estimated bounding box (BBox) as a prior or constraint to classify each pixel within the BBox (see Fig. 8 Illustration of Segmentation Head in Appendix).
The pretrained RETR components, such as the backbone, top- $K$ selection, and the detection transformer are reused with frozen weights to train the segmentation head (the lower branch) in Fig. 3 of the main paper.

We will point out the challenges in extracting finer-grained radar features, fusing features from two radar views, and utilizing the prior BBox from the detection head to support the pixel-level segmentation task. We will emphasize how to address these challenges by leveraging the DETR architecture to avoid cross-view radar feature association and introducing additional modifications, such as tunable positional embedding, radar-to-camera coordinate transformation, and tri-plane loss.

For the introduction of Generation of Radar Heatmaps, please consider citing the following paper: S. Sun, A. P. Petropulu and H. V. Poor, "MIMO Radar for Advanced Driver-Assistance Systems and Autonomous Driving: Advantages and Challenges," in IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 98-117, July 2020.

Thank you for the suggestion. We will add the suggested paper to the reference list.

Is there any interference (cross-path signals) between the two radars deployed in vertical and horizontal directions?

Thanks for raising this important point. To prevent cross-radar interference between the horizontal and vertical radar sensors, the two radars were configured to operate at different frequency bands. In the HIBER dataset, the horizontal radar operates in the $77 - 78.23$ GHz band, while the vertical radar is in the band of $79-80.23$ GHz. In the MMVR dataset, the horizontal radar operates in the $77-78.36$ GHz band, while the vertical radar is in the band of $79-80.36$ GHz. In both cases, there is a minimum gap of $500$ MHz between the operating frequency bands of the two radars.

Usually a large amount of high quality radar data is required to train the transformer. It is highly recommended to carry out validation with different amount of training data.

Good point. In Table 2 of the PDF in the global response, we report the impact of training data size on detection performance using the MMVR dataset. We compare the original data size (x1.0) with $190,441$ radar frames against reduced data sizes of half (x0.5) and one-tenth (x0.1). The result shows a gradual improvement in detection performance with an increase in data size, particularly at higher IoU thresholds, such as AP $_{75}$ .

审稿意见

评分: 6置信度: 42024-07-16

The paper introduces Multi-View Radar Detection Transformer for indoor object detection. Inspired from DETR, the authors propose and end-to-end RETR to detect objects from the radar inputs. To improve the feature association across the two radar views, the authors introduce a new Tunable Positional Embedding. The proposed approach achieves the solid performance on the standard benchmarks of radar-based indoor object detection.

优点

The paper is well-written and easy to follow.
The proposed approach is simple yet well-motivated. The proposed RETR eliminates the cumbersome design of the detection network for radar inputs.
The analysis of Tunable Positional Embedding Section 4.3 is comprehensive and well-motivated.
The proposed method achieves strong experimental results on MMVR and HIBER datasets.

缺点

I am wondering how the Top-K Feature selection impacts the performance of the network. How can we choose the number of K? It seems that there is no experiment to validate the choice of K. It will be better if the authors conduct an ablation study to explore the effectiveness of choosing K.
What is the computation cost of the proposed RETR model? It will be better if the authors report the inference time/computational cost of the proposed model
For the learnable radar-to-camera coordinate transformation, while I acknowledge the performance improvement according to this proposed model, I have several questions related to this module.
- Does the entire dataset use the same set of learnable vector $\omega$ and translation vector $t$ ? Or each radar sample in the data will have a different vector $\omega$ and translation vector $t$ ?
- How can we verify that the learnable radar-to-camera coordinate transformation is accurate? Is there any way to evaluate it using the ground truths?

问题

Please refer to my weakness section.

局限性

The authors have discussed the limitations and broader impact in the paper.

作者回复

2024-08-06

Thank you for the detailed comments and valuable feedback! We provide our point-to-point responses below.

I am wondering how the Top-K Feature selection impacts the performance of the network. How can we choose the number of K? It seems that there is no experiment to validate the choice of K. It will be better if the authors conduct an ablation study to explore the effectiveness of choosing K.

We appreciate Reviewer DJHb's suggestion. In response, we have conducted additional experiments to examine the impact of $K$ on detection performance. The results, detailed in Table 1 of the PDF in our global response, indicate that increasing $K$ improves object detection performance. However, as noted in our detailed response below to your next comment on the computation cost and inference time, choosing a larger $K$ also significantly increases the training and inference time.

What is the computation cost of the proposed RETR model? It will be better if the authors report the inference time/computational cost of the proposed model.

Thanks for pointing out the computation cost and inference time.

First, following the computational complexity notation used in the DETR paper, every self-attention mechanism in the encoder has a complexity of $\mathcal{O}(d^2 2K + d (2K)^2)$ where $d$ is the embedding dimension and $K$ is the number of selected features from the Top- $K$ selection. The cost of computing a single query/key/value embedding is $\mathcal{O}(d' d)$ (with $d=Md'$ where $M$ denotes the number of attention heads and $d'$ the dimension in each head), while the cost of computing the attention weights for one head is $\mathcal{O}(d' (2K)^2)$ . Other computations may be negligible. In the decoder, each self-attention mechanism has a complexity of $\mathcal{O}(d^2 N + d N^2)$ where $N$ is the number of queries, and the cross-attention between query and multi-view radar features has a complexity of $\mathcal{O}(d^2(N + 2K) + d 2NK)$ . In conclusion, the overall complexity of our RETR model is $\mathcal{O}(4d^2 K + 4d K^2 + 2d^2 N + d N^2 + 2d NK)$ .

Second, regarding the inference time, we report the average inference time in milliseconds in Table 3 of the PDF in the global response. We used an NVIDIA A40 GPU to evaluate the inference time over all frames in the test data. RETR achieved an average inference time of $23.75$ ms that is comparable to $20.89$ ms of RFMask.

For the learnable radar-to-camera coordinate transformation, while I acknowledge the performance improvement according to this proposed model, I have several questions related to this module. Does the entire dataset use the same set of learnable vector and translation vector ? Or each radar sample in the data will have a different vector and translation vector ?

The learned vectors $\omega$ and $t$ (or, equivalently, the rotation matrix $R$ and translation vector $t$ ) are fixed and applied consistently to all test frames.

During training, $\omega$ and $t$ were updated from one minibatch to the next, as they are part of the learnable parameters in RETR.

How can we verify that the learnable radar-to-camera coordinate transformation is accurate? Is there any way to evaluate it using the ground truths?

Good question. One way to verify the coordinate transformation learning is, for a given point in the 3D radar coordinate, by checking the distance between the two transformed points using calibrated and learned coordinate transformation. Although the calibrated coordinate transformation (including both rotation matrix and translation vector) is NOT ground truth due to radar resolution (also see our responses to Reviewer XrXa), it serves as a reasonable baseline or reference benchmark for the learned coordinate transformation.

In the PDF of the global response, Fig.1 shows the distance difference between the two transformed points over the training steps using the MMVR dataset. We randomly initialize the learned rotation matrix and translation vector at iteration 1. The results demonstrate that, as training progresses, the learned radar-to-camera coordinate transformation becomes increasingly aligned with the calibrated one, indicating that the learning is moving in the correct direction.

Finally, we'd like to add that by using the learnable radar-to-camera coordinate transformation, it is possible to incorporate the radar-to-camera geometry into the end-to-end radar perception pipeline without the need for a cumbersome calibration step, while still achieving comparable perception performance.

评论- Feedback to Author Rebuttal

2024-08-12

Thank the authors for the good rebuttal. It has addressed my concerns. I hope that you can include these answers and experiments in your revised version. Therefore, I decided to increase my score to 6.

2024-08-13

作者回复

2024-08-07

We thank all reviewers for their insightful comments, suggestions and questions. In the rebuttal form to each reviewer, we provide detailed point-to-point responses.

In these point-to-point responses, we often refer to the attached PDF for additional results in the form of tables and figures.

最终决定Accept (poster)

2024-09-25

The rebuttal provided clarifications about the proposed method and its analysis that were useful for assessing the paper's contribution and responded adequately to most reviewer concerns. All reviewers recommend acceptance after discussion (with three weak accepts and one accept), and the ACs concur. The final version should include all reviewer comments, suggestions, and additional clarifications from the rebuttal.