5.5

/10

Poster4 位审稿人

最低5最高7标准差0.9

3.0

置信度

正确性3.0

贡献度2.8

表达3.0

NeurIPS 2024

Real-time Stereo-based 3D Object Detection for Streaming Perception

Changcai Li,Zonghua Gu,Gang Chen,Libo Huang,Wei Zhang,Huihui Zhou

OpenReview PDF

提交: 2024-05-14更新: 2025-01-13

摘要

关键词

3D object detection; Streaming perception; Real-time perception

评审与讨论

审稿意见

评分: 5置信度: 32024-07-04

This paper proposed a real-time stereo 3D object detection algorithm under the streaming preception framework setting. The proposed StreamDSGN build on the existing streaming perception work and the DSGN 3D object detection work and has three technical contributions: 1) the Feature-Flow-Fusion module predicts future-frame features using flow map which reduce the misalignment between supervision and current observation in the streaming preception setting, 2) Motion Consistency Loss function for explicit supervision based on motion consistency between adjacent frames, 3) Large Kernel BEV Backbone for capturing long-range dependencies in low framerate 3D object detection dataset. Experiments on the KITTI Tracking dataset show that the proposed StreamDSGN method achieves impressive result on KITTI Tracking dataset.

优点

The proposed algorithm of 3D object detection for streaming perception is interesting, the performance evaluation setting for considering both accuracy and latency is reasonable for practical applications such autonomous driving
Experimental result is impressive and convincing: the end-to-end framework shows significant better performance than a straightforward combination of Streamer and DSGN++_{opt}, and the proposed FFF, MCL and LKBB techniques together achieve 4.33% increase in 3D object detection streaming average precision compared with the end-to-end baseline. The source code is provided and the experiment result is reproducible.
The presenation is clear and easy to follow
The ablation study is clear and the baseline setting is reasonable

缺点

The setting of 3D detection from stereo is not very common in autonomous driving and the relevant dataset is limited. It would be good to extend the stereo camera setting to multi-camera setting or camera+Lidar setting and test the generalization capability of the proposed algorithm on extended settings.
There lacks detailed computational latency analysis of each algrorithm module as shown in Figure.3, the overall latency of 91.45ms may not be enough to check the trade-off between performance & latency for each algorithm module

问题

How about the algrithm's (GPU) memory consumption?

局限性

The limitation of the proposed algorithm has been addressed in Section 5.

作者回复

2024-08-07

Response to reviewer D5z5

Weaknesses

W1: The setting of 3D detection from stereo is not very common in autonomous driving and the relevant dataset is limited. It would be good to extend the stereo camera setting to multi-camera setting or camera+Lidar setting and test the generalization capability of the proposed algorithm on extended settings.

A1: In theory, our method is indeed applicable to other BEV-based approaches. However, when deployed to other multi-view BEV-based methods, the following challenges may arise:

Streaming perception evaluation requires high-frame-rate annotations to obtain accurate results, but these datasets often have low frame rates. e.g., the nuScenes dataset has a 12Hz image frame rate but only a 2Hz annotation frequency.
In multi-view Query-based methods, these approaches (e.g., DETR3D, PETR) typically do not generate explicit BEV features.
In multi-view BEV-based 3D detection, LSS series methods (e.g., BEVDet, BEVDepth, BEVStereo) have the limitation of generating discrete and sparse BEV representations (Ref from FB-BEV[1]). These sparse BEVs may lead to numerous redundant warping operations in FFF and may result in potential distortions of warped objects. please compare the Figure 2 in [1] with our Figure 8 to observe the BEV feature difference.
For methods in the BEVFormer series, such approaches typically have higher latencies (over 400ms). A large latency interval requires a larger spatial search for the FFF, and the motion consistency assumed by the MCL may no longer hold.
Multimodal (camera + LiDAR) methods also have relatively large time consumption due to their dual-branch structure, which is not conducive to achieving an end-to-end streaming perception solution.

These challenges will be included in our discussion of limitations. We will also explore applications related to multi-view methods in the future.

W2: There lacks detailed computational latency analysis of each algorithm module as shown in Figure.3, the overall latency of 91.45ms may not be enough to check the trade-off between performance & latency for each algorithm module

A2: For streaming perception, when the inference speed of the detector exceeds the frame rate, there is no need to check the trade-off between performance and latency. This viewpoint is supported in StreamYOLO (refer from Ref [59]): "With these 'fast enough' detectors, there is no space for accuracy and latency trade-off on streaming perception as the current frame results from the detector are always matched and evaluated by the next frame." The latency of our StreamDSGN meets this requirement.

However, when facing higher frame rates, further acceleration optimization of the model is required to ensure that the detector can still predict the state of the next frame end-to-end. Therefore, we report the latency of each component to facilitate the next optimization efforts. The results are shown in the table below.

Module	Image Feature Extractor	BEV Downsampling	Feature Flow Fusion	Large Kernel BEV Backbone	Detection Head	Post processing	Total
Latency	73.62	0.02	7.71	5.06	1.52	4.10	92.03

We can see that the Image Feature Extractor has the highest latency at 73.62 ms, making it the primary source of delay. For faster real-time optimization, we might consider replacing this component with alternatives like MobileNet.

Questions

Q1: How about the algorithm's (GPU) memory consumption?

A1: When the batch size is 1, the memory consumption during the training phase is 11178M, while during the inference phase, it is only 2860M.

Reference

[1] Li Z, Yu Z, Wang W, et al. Fb-bev: Bev representation from forward-backward view transformations[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6919-6928.

2024-08-12

Thanks the authors for their response. My questions have been answered and I think it is reasonable to consider the multi-View setup in future work. I'll keep my original positive rating

审稿意见

评分: 5置信度: 42024-07-09

In this paper, the authors propose a real-time stereo-based 3D object detection framework for streaming perception. Specifically, the authors design feature flow fusion, motion consistency loss and a Large Kernel BEV backbone to improve the performance. The authors validate the effectiveness of the proposed method and each module by conducting experiments on KITTI dataset. However, the author only compare their method with one baseline which they build their method on.

优点

As claimed in the paper, this is the firstwork designed for 3D object detection streaming perception, which is a good setting that better aligns with real applications.
The authors analyze the challenges in streaming perception, and add new modules into existing framework to improve the performance
Experiments on the Kitti dataset validate the effectiveness of the three modules designed in this paper.

The paper is well structured and the visualization is good.

缺点

The novelty of the proposed method is not significant. The challenges in streaming perception are evident, and the technical novetly of the proposed solution is not very significant to me. According to Table2, the biggest performence improvement comes from the fusion of (t-1). The three modules do not result in sigificant performance improvement.

The intuition of MCL needs more elaboration. Why do we need velocity loss and acc loss? The supervision of position already encodes velocity and acc.

The latency of the network is strongly dependent on the hardware. Also, the latency caused by camera exposure and data transportation are not considered. A comprarison under different latency can be added to better illustrate the usefulness of the method under different scenarios.

问题

There are some grammar errors in the manuscript.

局限性

The limitations are addreess in the paper, including incoreect feature fusion caused by occlusion and truncation.

作者回复

2024-08-07

Response to reviewer Ytfb

Weaknesses

W1: The novelty of the proposed method is not significant. The challenges in streaming perception are evident, and the technical novetly of the proposed solution is not very significant to me. According to Table2, the biggest performence improvement comes from the fusion of (t-1). The three modules do not result in sigificant performance improvement.

A1: Previous works proposed the concept of streaming perception, to the best of our knowledge, we first apply the streaming perception for 3D object detection. Note that trying to predict future states using information from a single moment is an ill-posed problem. Therefore, fusing historical information is the basic idea behind streaming perception, and the differences lie in the details on how the prediction is done. In this work, this practice (fusion of $t-1$ ) builds the basic framework of our 3D object detection algorithm, in other words, it is the baseline of our method. So, it is not strange that it contributes to the biggest performance improvement. The three modules in this manuscript are our further attempts to enhance perception accuracy based on the baseline. Their contributions are indeed not as significant as fusing the historical information, but the combination of the baseline and these modules makes our method achieve the SOTA performance of 3D streaming perception.

W2: The intuition of MCL needs more elaboration. Why do we need velocity loss and acc loss? The supervision of position already encodes velocity and acc.

A2: The basic premise of streaming perception is that the motion trajectory is smooth without abrupt changes in velocity and acceleration, thus we can use the recent past to predict the near future. If this assumption does not hold, fusing historical information is beneficial for predicting the future. Therefore, we introduce MCL to provide additional supervision of constant velocity and acceleration motion.

Regarding the comment that "The supervision of position already encodes velocity and acc", please note that without MCL, the model is supervised for bounding box regression solely by a single $G_{t+1}$ , which does not include velocity and acceleration information. By incorporating MCL, the model includes both the basic bounding box regression loss and motion constraints of the nearest historical intervals. This technique can improve convergence during training and reduce localization errors.

W3: The latency of the network is strongly dependent on the hardware. Also, the latency caused by camera exposure and data transportation are not considered. A comprarison under different latency can be added to better illustrate the usefulness of the method under different scenarios.

A3: Our algorithm StreamDSGN has inference latency of 91.45 ms on the NVIDIA TITAN RTX GPU (16.3 TFLOPS). (We also tested StreamDSGN on the RTX 3090 GPU (35.6 TFLOPS), and the latency was only around 60ms.)

To quantify the impact of latency, we conducted additional experiments by adding artificial extra delays which can be seen as the latency caused by the camera exposure and data transportation. Specifically, we add random delays (Gaussian distributions with different means and the same variance) to the inference latency of each frame (Average about 91.45ms). The experimental results are shown in the table below

Random noise	Easy	Moderate	Hard
0	77.47	63.76	57.42
$X\sim \mathcal{N}(0, 2)$	75.86	62.35	57.17
$X\sim \mathcal{N}(5, 2)$	67.97	57.25	52.23
$X\sim \mathcal{N}(8, 2)$	38.79	30.12	26.82
$X\sim \mathcal{N}(10, 2)$	28.05	21.05	17.46
$X\sim \mathcal{N}(20, 2)$	24.73	18.72	15.89

We report the $\rm{sAP_{3D}}$ for the Car category at IoU=0.7. We can see that when extra delay noise is initially introduced, the model's performance begins to slightly decrease. This is because the inference delay of a small number of samples has already exceeded the inter-frame interval of 100 ms. when the average additional delay is 8 ms, the performance degradation is particularly severe compared to the average additional delay of 5ms. This happens because the average latency reaches 91.45 + 8 = 99.45 ms, causing nearly half of the samples to have inference times that exceed the inter-frame interval. Consequently, the model can no longer respond promptly to each frame. In other words, in practice, ensuring the data and inference latency less than the inter-frame interval is important. Also, these results underscore the crucial role of real-time computation performance in streaming perception tasks, through using better hardware or more optimized algorithms.

Questions

Q1: There are some grammar errors in the manuscript.

A1: Thank you for bringing them to our attention. We have rechecked the grammatical issues.

评论- typographical error in A2

2024-08-12

We apologize for the typographical error in A2 of the rebuttal. The revised version should be: "If this assumption does not hold, fusing historical information is not beneficial for predicting the future."

2024-08-12

Thank you for the response. According to the experiment results of additional delay, the robustness of the proposed method is not very strong. Also, there is no comparison with baselines.

2024-08-12

Thank you for the response. Indeed, delay robustness largely depends on the computational platform. For example, when our method is run on a 3090 (around 60ms inference time), even with a hardware delay of 30ms (e.g., due to camera exposure), the streaming accuracy would not significantly decline. This is because the total latency is still less than the inter-frame interval. As for the baseline, it has been represented in the first row of our table, indicating the accuracy without any delay noise.

审稿意见

评分: 7置信度: 22024-07-12

The work proposes StreamDSGN, a framework for steaming perception, evaluated on the KITTI dataset with sAP (steaming average performance metric), which takes into account the model latency and is a better fit for evaluating streaming perception than purely qulitative metrics. It is the first work to do 3D object detection in streaming perception setting. The DSGN++ base detector is used and performance is improved. The paper introduce three methods to improve the results: (1) Using past frames' feature positions to predict the next frames features, (2) additional supervision, (3) a larger kernel for bigger receptive field.

优点

The ablation study is exhaustive.
Figures are well done and generally support the understanding (Figure 2 is adressed separately).
The method achieves a better result than the baseline.

缺点

Limitations are discussed only really shortly
Figure 2 needs a little more explanation. There is more than one GT-trajectory, are they from differnt frames? Prediction and GT are for the next frame, t+1?
Unclear sentence in the introduction: "It is observed that for moving objects, the ground truth of the next frame (depicted by the red bounding box) consistently precedes their current position." What does it mean for the ground truth to "precede the current position"?
The largest concern: The test setup, with interleaved training / testing frames does not really provide a separate testing set. All scenes are split into sequences of 4 seconds, which are then used alternatingly for test and training. In urban driving scenarios, cars won't move that much in 4 seconds. This leads to a significantly smaller domain gap between test and training than when, for example, training on completely different sequences. It is unclear if this split is a common protocol for KITTI or though off for this paper.

问题

See weaknesses 3 and expecially 4.

局限性

Limitations are discussed.

作者回复

2024-08-07

Response to reviewer NjQK

Weaknesses

W1: Limitations are discussed only really shortly.

A1: Our discussions on limitations and future work are brief due to the page limit. We may add more detailed discussion in the appendix.

W2: Figure 2 needs a little more explanation. There is more than one GT-trajectory, are they from different frames? Prediction and GT are for the next frame, t+1?

A2: Yes. The ground-truth trajectory (denoted by red dashed arrows) refers to different timesteps/frames, e.g., the two arrows denote the trajectory from $t-1$ to $t$ and from $t$ to $t+1$ , respectively, while the predicted trajectory (denoted by the green solid arrow) is from $t$ to $t+1$ .

W3: Unclear sentence in the introduction: "It is observed that for moving objects, the ground truth of the next frame (depicted by the red bounding box) consistently precedes their current position." What does it mean for the ground truth to "precede the current position"?

A3: The sentence is indeed unclear, and we plan to rewrite it as follows: “For a moving object with (approximately) constant velocity, its ground truth position in the next frame at time $t + 1$ (depicted by the red bounding box) is likely to be different from its position in the current frame at time $t$ , and its movement is predictable based on its recent history of positions in frames $t − 1$ and $t$ .”

W4: The largest concern: The test setup, with interleaved training / testing frames does not really provide a separate testing set. All scenes are split into sequences of 4 seconds, which are then used alternatingly for test and training. In urban driving scenarios, cars won't move that much in 4 seconds. This leads to a significantly smaller domain gap between test and training than when, for example, training on completely different sequences. It is unclear if this split is a common protocol for KITTI or though off for this paper.

A4: Thank you for this good question; it will make our experiments more reasonable. We added an experiment to compare the domain gap between our split tracking dataset (train:val = 4291:3672) and the widely recognized KITTI Object Detection dataset (train:val = 3712:3769). Specifically, we trained and tested PointPillar and DSGN++ on both datasets and compared the $\rm AP_{3D}$ for the Car category at IoU = 0.7. The experimental results are shown in the table below (we will add this experiment to our manuscript):

Method	Sensor	Dataset	Easy	Moderate	Hard
PointPillar	LiDAR	Object Detection	87.75	78.38	75.18
PointPillar	LiDAR	our split Tracking	94.57	88.35	84.85
DSGN++	Stereo	Object Detection	83.63	66.41	61.38
DSGN++	Stereo	our split Tracking	91.79	78.35	69.79

From the table, we can observe that both methods perform better on our split tracking dataset, which indeed proves that our domain gap is smaller compared to that of the Object Detection dataset. However, considering that we have 579 additional training samples and that accuracy may further decrease under streaming perception constraints, this difference is considered reasonable.

Furthermore, please note that, as this is the first work in this area, our focus is not on showing how high the accuracy is, but on demonstrating the effectiveness of our method. All our experiments were conducted on the same data split. In the future, we plan to conduct further validation on larger-scale datasets with higher frame rates and greater domain gaps.

2024-08-12

Thank you for the rebuttal, the (important) clarifications and the additional experiments, especially regarding the domain gap.

审稿意见

评分: 5置信度: 32024-07-13

The paper presents stereo-based 3D object detection method designed for streaming perception, where the current frame (and past frames) are taken to predict the object bounding boxes in the next frame. The authors adopt a simplified DSGN++ as the backbone and add several components to enhance the perception accuracy. First, the authors proposed to estimate the flow in feature space from t-1 to t and use it to warp the features in the future frame at t+1. Besides, a motion consistency loss is added to refine the future trajectory. Last, the authors propose a large kernel backbone to process the BEV feature. The proposed method could reach a latency of 90ms and outperform a baseline (using Kalman Filter) on the KITTI dataset.

优点

This paper is well-motivated and aims to solve an important problem in the real world applications
The proposed method yields an appealing performance in streaming perception in the KITTI dataset.
The way of generating feature flow is interesting and could be used in other automonous driving applications

缺点

Though the proposed method shows a good performance, some details about the proposed new components are missing. Besides, the proposed method is only tested in one dataset and compared with one baseline, which may not be sufficient. Please refer to Questions for more details.

问题

Though the proposed method focuses on stereo-based 3D object detection, the new components proposed in this paper are not related to the stereo setting and seem to work generally for all BEV-based perception systems. Why not test it on more general settings and use more datasets (e.g., use the six cameras in nuScene)?
For the Feature-Flow Fusion, is there any reason why the authors apply warping in feature space instead of using optical flow in the pixel space?
For motion consistency loss, it makes sense to constrain the velocity and acceleration for the new prediction. But given an estimated bounding box, how do you know which object it is to retrieve its past trajectory? What if the object is only contained in the current frame (not appear in the past frames)? And what if the network predicts a wrong object (false positive) during the early training stage? How to calculate the loss then? It would be better to provide more explanations.
In section 3.3, does $G_t^{pose}$ mean the same thing as $G_t^{box}$ ?
The proposed method is compared with a baseline using Kalman Filter. But It seems like Kalman Filter is very easy to beat. What if using more advanced trajectory prediction methods as the baseline? Will the proposed method still show a large performance gain?

局限性

The authors have adequately addressed the limitations.

作者回复

2024-08-06

Response to reviewer uBXC

Question

Q1: Though the proposed method focuses on stereo-based 3D object detection, the new components proposed in this paper are not related to the stereo setting and seem to work generally for all BEV-based perception systems. Why not test it on more general settings and use more datasets (e.g., use the six cameras in nuScene)?

A1: In theory, our method is indeed applicable to other BEV-based approaches. However, when deployed to other multi-view BEV-based methods, the following challenges may arise:

Streaming perception evaluation requires high-frame-rate annotations to obtain accurate results, but these datasets often have low frame rates. e.g., the nuScenes dataset has a 12Hz image frame rate but only a 2Hz annotation frequency.
In multi-view BEV-based 3D detection, LSS series methods (e.g., BEVDet, BEVDepth, BEVStereo) have the limitation of generating discrete and sparse BEV representations (Ref from FB-BEV[1]). These sparse BEVs may lead to numerous redundant warping operations in FFF and may result in potential distortions of warped objects. please compare the Figure 2 in [1] with our Figure 8 to observe the BEV feature difference.
For methods in the BEVFormer series, such approaches typically have higher latencies (over 400ms). A large latency interval requires a larger spatial search for the FFF, and the motion consistency assumed by the MCL may no longer hold.

These challenges will be included in our discussion of limitations. We will also explore applications related to multi-view methods in the future.

Q2: For the Feature-Flow Fusion, is there any reason why the authors apply warping in feature space instead of using optical flow in the pixel space?

A2: Here are the reasons for this choice:

The dataset lacks synchronized optical flow ground truth.
Even if optical flow ground truth were available, using an optical flow estimation model would introduce additional time overhead. In contrast, computing flow at an intermediate level does not require an extra feature extraction process.
Intermediate-level features include image features (img-coord) and BEV features (world-coord). Performing warping on image features inevitably lead to misalignment between the image features and the depth map ground truth. In contrast, warping on BEV features does not affect depth regression and better aligns with object movement in the real world.

Q3: For motion consistency loss, it makes sense to constrain the velocity and acceleration for the new prediction. But given an estimated bounding box, how do you know which object it is to retrieve its past trajectory? What if the object is only contained in the current frame (not appear in the past frames)? And what if the network predicts a wrong object (false positive) during the early training stage? How to calculate the loss then? It would be better to provide more explanations.

A3: Our manuscript already explains the question about how to retrieve an object's past trajectory in Section 3.3. The initial step in calculating the MCL involves establishing correspondences between bounding boxes across different time steps. We establish the correspondence between $P_{t+1}$ and $G_{t}$ using an IoU matrix (same as streamYOLO), and establish the correspondence between $G_{t-2}, G_{t-1}, G_{t}$ using object IDs.

As for objects that appear in only a single frame, since there is no continuous trajectory, we do not calculate their MCL either.

Regarding false positives, they occur when the classification head gives high scores to negative anchors, unrelated to the regression head. Our MCL is a regression supervision method applied only to positive anchors (as in PointPillar, SECOND, etc.). Please note that, whether an anchor is positive or negative is determined by the IoU between the preset anchor and the ground truth bounding box, not by the classification score.

Q4: In section 3.3, does $G_{t}^{pose}$ mean the same thing as $G_{t}^{box}$ ?

A4: $G_{t}^{box} = \\{ x,y,z,l,w,h,\theta \\}$ , and $G_{t}^{pose} = \\{ x,y,z,\theta \\}$ , where $\\{x,y,z\\}$ denotes the object's spatial position, $\\{l,w,h \\}$ denotes the dimension and $\theta$ denotes the heading angle. Since the dimensions of the object remain constant, our MCL only needs to compute changes in position and orientation.

Q5: The proposed method is compared with a baseline using Kalman Filter. But It seems like Kalman Filter is very easy to beat. What if using more advanced trajectory prediction methods as the baseline? Will the proposed method still show a large performance gain?

A5: Previous methods to Streaming Perception can be divided as (refer from Ref [52]): (a) Velocity-based updating (non-end-to-end), where the Kalman filter is utilized to associate and multi-frame detection results, and the future state is predicted by the constant velocity motion model, e.g., The work in Ref [30] proposed a meta-detector named Streamer that can be combined with any object detector. (b) Learning-based forecasting (end-to-end), where the future state is directly estimated by the 3D-future detector, e.g., StreamYOLO from Ref [59].

StreamYOLO has shown the advantages of end-to-end methods. Therefore, even when combined with more advanced trajectory prediction methods, non-end-to-end methods may struggle to achieve our level of accuracy. We show comparisons between both types of methods in our manuscript: the comparison with non-end-to-end methods is detailed in Table 1, while the comparison with end-to-end methods is shown in Table 2, settings c and g. We plan to consolidate both comparisons into a single table to avoid confusion.

Reference

[1] Li Z, Yu Z, Wang W, et al. Fb-bev: Bev representation from forward-backward view transformations[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6919-6928.

评论- Response to the authors

2024-08-11

I would like to thank the authors for their response. Most of my concerns have been resolved so I increased my score. The extra explanations are very helpful. I encourage the authors to add them to the revised version.

作者回复

2024-08-07

Summary

Multi-View Setup Limitations

Some reviewers have questioned why this work did not conduct experiments on multi-view setups. One reason is the lack of high-frame-rate datasets. Additionally, existing multi-view methods typically have high latency or sparse BEV features, making it challenging to directly apply our components to these methods. This will be a direction for our future exploration.

Domain Gap Concern

Some reviewers have raised concern about domain gap in the dataset. We have supplemented our experiments by comparing our results with those from well-recognized datasets. The results show that our domain gap are indeed smaller. However, considering that our dataset has more training samples and that the detection accuracy further decreases under streaming evaluation constraints, our domain gap is therefore acceptable.

Comparison on Advanced Trajectory Tracking Algorithms

Some reviewers have wondered what would happen if more advanced trajectory tracking algorithms were used for comparison. We compared with the Kalman filter because it is a benchmark method (non-end-to-end). StreamYOLO has already demonstrated the advantages of directly predicting future states in an end-to-end manner. Therefore, even when combined with more advanced trajectory prediction methods, non-end-to-end methods may struggle to achieve our level of accuracy. Also, our manuscript includes comparison with SOTA end-to-end methods (Table 2, settings c and g).

Novelty and Impact of t-1 Fusion

Some reviewers have raised concern about the novelty of our work and noted that the main accuracy improvement comes from the fusion of $t-1$ . To the best of our knowledge, we are the first to apply streaming perception to 3D object detection. The fusion of $t-1$ is a basic aspect of our framework, it is not strange that it contributes to the biggest performance improvement. The three modules in this manuscript are our further attempts to enhance perception accuracy based on the baseline.

Latency of Each Module and Hardware Impact

Some reviewers have raised questions about the latency of each module and the impact of hardware latency. We have added experiments to address these issues. For the latency of each module, our experiments show that the image feature extractor has the highest latency at 73.62 ms. Further optimization efforts can focus on this aspect. Regarding the impact of hardware latency, when the combined hardware and inference latency is less than the inter-frame interval, the accuracy only slightly decreases; however, when the total latency reaches or exceeds the inter-frame interval, the accuracy drops significantly. This further underscores the importance of real-time performance in streaming perception tasks.

最终决定Accept (poster)

2024-09-25

All reviewers recommended acceptance, though 3 are on the borderlines. This AC sees no reason to override the collective recommendations of the reviewers. The main contribution of this paper, 3D object detection for streaming perception, and the good results, make it a worthy paper.